BioinfoMachineLearning / DIPS-Plus

The Enhanced Database of Interacting Protein Structures for Interface Prediction
https://zenodo.org/record/5134732
GNU General Public License v3.0
43 stars 8 forks source link

question - are all protein pairs direct interactors? #24

Open rubenalv opened 2 months ago

rubenalv commented 2 months ago

I was thinking of using DIPS-Plus for training a classifier for protein-protein interaction (I confess I have little experience in ML, any advice welcome). The database describes the pairs based on the distance between their atoms, and I saw some few examples of pairs where there are less than 5 atoms within the threshold distance. So two questions:

Perhaps the second question is more for eg Stack Exchange, but I'm open to any advice here!

rubenalv commented 2 months ago

I decided to map the chain pairs to IntAct, to check if they were annotated as direct interactors. I used these resources:

At an IntAct miscore >= 0.45 (recommended setting) and selecting only direct interactors, out of the 42K pairs in the DIPS-plus I mapped only 4564. With a miscore < 0.45 I collected 1758 pairs.

So the conclusion, at least based on the IntAct data, is that only a fraction of the DIPS-plus pairs contain chains in direct interaction. Anyone that wants to use this dataset for classification of protein-protein direct interaction should curate it. If the goal is to annotate atoms in proximity between the chains in the pairs, or create point cloud embeddings like the dMaSIF ones, 42K pairs makes a great dataset.

@amorehead, I'll leave the issue open in case you would like to comment, otherwise feel free to close it. Thanks for the resource!

rubenalv commented 1 month ago

I noted that the chain pairs in the .dill files do not take into account homomultimers. E.g. 2YKS, that is a pentamer of 5 identical sequences, generates pairs 2YKS_A_B, 2YKS_A_C, etc, which are identical. This will create some imbalance (and burden) when training. I realised this running FoldSeek cluster, so using the FoldSeek-based split in the database is encouraged.