Open dacarlin opened 4 years ago
Hello
The encoding was simply to map from the amino acid to the normalized version of BLOSUM62. Here you can find the version of this substitution matrix: http://www.cbs.dtu.dk/courses/27623.algo/exercises/data/Cprog/blosum62.freq_rownorm
Thank you very much for this information and your quick reply
Looking at the table linked above, I think I am still confused. For example, in the array x
in data/targetp_data.npz
, in the first encoded sequence at x[0]
(shape 200, 20), the first encoded residue (which I assume is Met, x[0][0]
), is encoded by the vector
[0.05220884, 0.03212851, 0.02008032, 0.02008032, 0.01606426,
0.02811245, 0.02811245, 0.02811245, 0.01606426, 0.10040161,
0.19678715, 0.03614458, 0.16064256, 0.04819277, 0.01606426,
0.03614458, 0.04016064, 0.00803213, 0.02409638, 0.09236947]
However, I do not see a row or a column in the data set that corresponds to this encoded residue. Perhaps I should say what I am trying to do: I have trained the model on the provided data set using train.py
, and I would like to use the trained model for inference on some protein sequences from UniProt, so I assume the first step is to encode my sequences the way that the training set is encoded
Would it be possible to provide code that takes an amino acid sequence as a string and encodes it in the format of the
x
array indata/targetp_data.npz
? Or, if you provide me with the frequency matrix you that was used, I would be happy to contribute code that does that. I understand that each amino acid gets mapped to a 20-mer vector of the substitution frequencies that are used to create BLOSUM62, but I'm not sure where to find the frequency matrix itself