Provide code or an explanation for the BLOSUM62 encoding

dacarlin commented 4 years ago

Would it be possible to provide code that takes an amino acid sequence as a string and encodes it in the format of the x array in data/targetp_data.npz? Or, if you provide me with the frequency matrix you that was used, I would be happy to contribute code that does that. I understand that each amino acid gets mapped to a 20-mer vector of the substitution frequencies that are used to create BLOSUM62, but I'm not sure where to find the frequency matrix itself

JJAlmagro commented 4 years ago

Hello

The encoding was simply to map from the amino acid to the normalized version of BLOSUM62. Here you can find the version of this substitution matrix: http://www.cbs.dtu.dk/courses/27623.algo/exercises/data/Cprog/blosum62.freq_rownorm

dacarlin commented 4 years ago

Thank you very much for this information and your quick reply

dacarlin commented 4 years ago

Looking at the table linked above, I think I am still confused. For example, in the array x in data/targetp_data.npz, in the first encoded sequence at x[0] (shape 200, 20), the first encoded residue (which I assume is Met, x[0][0]), is encoded by the vector

[0.05220884, 0.03212851, 0.02008032, 0.02008032, 0.01606426,
       0.02811245, 0.02811245, 0.02811245, 0.01606426, 0.10040161,
       0.19678715, 0.03614458, 0.16064256, 0.04819277, 0.01606426,
       0.03614458, 0.04016064, 0.00803213, 0.02409638, 0.09236947]

However, I do not see a row or a column in the data set that corresponds to this encoded residue. Perhaps I should say what I am trying to do: I have trained the model on the provided data set using train.py, and I would like to use the trained model for inference on some protein sequences from UniProt, so I assume the first step is to encode my sequences the way that the training set is encoded

JJAlmagro / TargetP-2.0

Provide code or an explanation for the BLOSUM62 encoding #1