Closed Rohit-Satyam closed 2 weeks ago
I went back to the person who ran the tool and turns out the production was run by breaking down the larger proteins (>1500aa) and running the predictions on each part separately. That explains two different probabilities. Apologies for the confusion.
Glad to hear that the problem was resolved. Just a reminder, in the manuscript, we use the first 1,500 amino acids of each protein. Larger proteins are truncated to 1,500. For sequences smaller than 1,500, padding with zeros on the right of the sequence is necessary only during training (where one pads to either 1500 or the largest sequence in the batch, whichever is smaller). At inference, inputs may be any size under 1,500.
Dear @jszym
We observed that Intrepppid gives different probabilities for the same set of sequences if the input sequence is flipped. An example is given below. And we observe that the difference between the predicted percentage is usually around 10%
In our experience, the D-Script gives the same probability score no matter the order of the proteins given as input. Can you say why this happens and if it is preventable? Currently, we are taking average of the probabilities to circumvent this issue.