LirongWu / MAPE-PPI

Code for ICLR 2024 (Spotlight) paper "MAPE-PPI: Towards Effective and Efficient Protein-Protein Interaction Prediction via Microenvironment-Aware Protein Embedding"
MIT License
253 stars 44 forks source link

About the Training Process #3

Closed AmingWu closed 5 months ago

AmingWu commented 7 months ago

Dear Authors,

Do you release the training code? How long does this method need to train?

AlphaZYL commented 7 months ago

Dear Authors, I ran the code using NVIDIA GTX 5000 Ada for a long time but it has not finished training. Could you please tell me if I made any mistakes or if the training process itself takes a long time?

LirongWu commented 7 months ago

The training code has been released. The training time depends on the used dataset and hardware platform. I would suggest that you first train on the smallest (fastest) SHS27k dataset, which he can be done in a couple hours on the NVIDIA A100. As for the largest STRING dataset, despite the fact that we did a lot of speedups, unfortunately it does require a very long training time to finish.

AlphaZYL commented 7 months ago

Thank you for your response. Additionally, I noticed that the results provided in your article, as well as those obtained from running public code, are both specific numerical values. I am interested in whether your work can be utilized for protein-protein interaction prediction, such as predicting interactions between two given proteins. Can it directly output whether they have an interaction (outputting 1 if they do, and 0 if they don't)? Considering that your data source is STRING, I would like to inquire whether your method is applicable to predicting protein interactions across species, such as between viruses and human hosts. Thank you!

LirongWu commented 7 months ago

The topic studied in this work is interaction category prediction, i.e. multi-label classification. The numerical values are results on the test datasets. This work can be easily extended to predict whether two proteins interact (0/1), a binary classification problem. One possible measure would be to modify the output dimension to 2 and train with the corresponding labeled data. Furthermore, as a general method, we believe that it has the potential to be extended to other species, but only if one preprocesses those data and uses them to re-train the model.

AlphaZYL commented 7 months ago

Dear Author, I would like to use your model to make predictions based on the data I need. Regarding the data preprocessing part, what is the source and purpose of the "all_assign.txt" file in the "raw_data" folder? How were the "{}_ppi.pkl" and "{}_ppi_label.pkl" files obtained in the "processed_data" folder on your GitHub? I speculate that the data is obtained from the PDB file. However, I have noticed that only the data from the "{}_protein_graphs.pkl" file can be obtained from the PDB file. I would like to seek your guidance on the source of data for the other two pkl files. Thank you!

LirongWu commented 7 months ago

“all_assign.txt” is a file describing the physicochemical features of each amino acid, proven valid by a previous work (https://github.com/zqgao22/HIGH-PPI).

“{}_ppi.pkl” and “{}_ppi_label.pkl” are extracted from “protein.actions. STRING” and "protein.STRING.sequences.dictionary" by runing "dataloader.py".The PDB files only characterize each protein and are not directly related to PPIs.