Emad-COMBINE-lab / intrepppid

Incorporating Triplet Error for Predicting PPIs using Deep Learning
https://emad-combine-lab.github.io/intrepppid/
GNU Affero General Public License v3.0
0 stars 0 forks source link

Symmetrical interactions gives different Probabilities for interaction #4

Closed Rohit-Satyam closed 2 weeks ago

Rohit-Satyam commented 2 weeks ago

Dear @jszym

We observed that Intrepppid gives different probabilities for the same set of sequences if the input sequence is flipped. An example is given below. And we observe that the difference between the predicted percentage is usually around 10%

PF3D7_xxxx400,PF3D7_xxxx100,0.5979217886924744
PF3D7_xxxx100,PF3D7_xxxx400,0.6844050884246826

## The proteins below are entirely different than above two prtoeins. IDs have been masked
PF3D7_xxxx600,PF3D7_xxxx100,0.7216829061508179
PF3D7_xxxx100,PF3D7_xxxx600,0.627219021320343

In our experience, the D-Script gives the same probability score no matter the order of the proteins given as input. Can you say why this happens and if it is preventable? Currently, we are taking average of the probabilities to circumvent this issue.

Rohit-Satyam commented 2 weeks ago

I went back to the person who ran the tool and turns out the production was run by breaking down the larger proteins (>1500aa) and running the predictions on each part separately. That explains two different probabilities. Apologies for the confusion.

jszym commented 2 weeks ago

Glad to hear that the problem was resolved. Just a reminder, in the manuscript, we use the first 1,500 amino acids of each protein. Larger proteins are truncated to 1,500. For sequences smaller than 1,500, padding with zeros on the right of the sequence is necessary only during training (where one pads to either 1500 or the largest sequence in the batch, whichever is smaller). At inference, inputs may be any size under 1,500.