Closed AmingWu closed 5 months ago
Dear Authors, I ran the code using NVIDIA GTX 5000 Ada for a long time but it has not finished training. Could you please tell me if I made any mistakes or if the training process itself takes a long time?
The training code has been released. The training time depends on the used dataset and hardware platform. I would suggest that you first train on the smallest (fastest) SHS27k dataset, which he can be done in a couple hours on the NVIDIA A100. As for the largest STRING dataset, despite the fact that we did a lot of speedups, unfortunately it does require a very long training time to finish.
Thank you for your response. Additionally, I noticed that the results provided in your article, as well as those obtained from running public code, are both specific numerical values. I am interested in whether your work can be utilized for protein-protein interaction prediction, such as predicting interactions between two given proteins. Can it directly output whether they have an interaction (outputting 1 if they do, and 0 if they don't)? Considering that your data source is STRING, I would like to inquire whether your method is applicable to predicting protein interactions across species, such as between viruses and human hosts. Thank you!
The topic studied in this work is interaction category prediction, i.e. multi-label classification. The numerical values are results on the test datasets. This work can be easily extended to predict whether two proteins interact (0/1), a binary classification problem. One possible measure would be to modify the output dimension to 2 and train with the corresponding labeled data. Furthermore, as a general method, we believe that it has the potential to be extended to other species, but only if one preprocesses those data and uses them to re-train the model.
Dear Author, I would like to use your model to make predictions based on the data I need. Regarding the data preprocessing part, what is the source and purpose of the "all_assign.txt" file in the "raw_data" folder? How were the "{}_ppi.pkl" and "{}_ppi_label.pkl" files obtained in the "processed_data" folder on your GitHub? I speculate that the data is obtained from the PDB file. However, I have noticed that only the data from the "{}_protein_graphs.pkl" file can be obtained from the PDB file. I would like to seek your guidance on the source of data for the other two pkl files. Thank you!
“all_assign.txt” is a file describing the physicochemical features of each amino acid, proven valid by a previous work (https://github.com/zqgao22/HIGH-PPI).
“{}_ppi.pkl” and “{}_ppi_label.pkl” are extracted from “protein.actions. STRING” and "protein.STRING.sequences.dictionary" by runing "dataloader.py".The PDB files only characterize each protein and are not directly related to PPIs.
Dear Authors,
Do you release the training code? How long does this method need to train?