FreyrS / dMaSIF

Other
193 stars 45 forks source link

Running you example does not provide same ROC-AUC as claimed in the paper #37

Open wjs20 opened 2 years ago

wjs20 commented 2 years ago

Hi

I ran the below command (had to fetch the pretrained model from the dMaSIF_colab repo) to test the site prediction accuracy and the ROC-AUC was on average of 0.62, not the 0.87 you state in the paper. How did you get this figure?

python --experiment_name dMaSIF_site_3layer_16dims_9A_100sup_epoch64 --batch_size 64 --embedding_layer dMaSIF --site True --emb_dims 16 --device cuda:0 --radius 9.0 --n_layers 3

I also ran the same command but for search prediction and I got an average ROC-AUC of ~0.5 (so essentially random).

It would be really helpful if you could explain how to use this tool to get the results you say you obtained in the paper.

Thanks

rubenalv commented 1 year ago

@wjs20, did you get any further, by any chance?

wjs20 commented 1 year ago

Unfortunately not.. did you have a stab at it?

rubenalv commented 1 year ago

@wjs20, a full broadsword thrust... Had to correct probably all the modules, eg the boolean arguments in Arguments.py, that are wrongly coded -very fun to ask for True and get a False-, realised that it was not only hard-coded, but that the code failed with new training data, and after a long while and lots of edits for the memory leak -that in the end could not fix, but is in the training loop-, I trained the model with the DIPS-plus dataset and got above 0.8 auroc. If you are after finding the PPI surface give pesto a go (https://pesto.epfl.ch/). The website will take you to the paper, and the paper to the github. The github requires work to figure out, but nowhere as much as dmasif.

wjs20 commented 1 year ago

Yeah I wasted alot of time on this just trying to fix all the code that was broken. Thanks for the tip, will try pesto.

orange2350 commented 10 months ago

@rubenalv Hi, may I ask where you are saying there is a problem? I can run dmasif_search, but the mean auroc value of the prediction result is only about 0.55, have you found out what's wrong, thanks you very much

rubenalv commented 10 months ago

Hi, I do not have access to my computer right now, but I can say that there are tons of problems. For one, check you arguments.py because you may not even be running the search (the argument is defined as boolean, wrongly (check the docs for the arguments library), so the output is not always what you expect). You can insert a print(params) in the main script to see if search==True and check the other booleans. But I only recommend going into using dmasif if you have time to troubleshoot it, otherwise the masif software is better documented and mainstream.

On Tue, Dec 26, 2023, 08:36 Chen Zhiyi @.***> wrote:

@rubenalv https://github.com/rubenalv Hi, may I ask where you are saying there is a problem? I can run dmasif_search, but the mean auroc value of the prediction result is only about 0.55, have you found out what's wrong, thanks you very much

— Reply to this email directly, view it on GitHub https://github.com/FreyrS/dMaSIF/issues/37#issuecomment-1869331778, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGHTHISARD4NXFD5TR4VDSTYLJ47FAVCNFSM6AAAAAARJGKH5KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNRZGMZTCNZXHA . You are receiving this because you were mentioned.Message ID: @.***>

orange2350 commented 10 months ago

@rubenalv Thank you very much for your reply. I went and checked the Arguements file, I reran the main_training and the inference PY file, printed the args and it shows that search is True. but I found the error during the run, it saves the prediction during the inference process after the prediction inputs, when saving the file, it saves the prediction results as predction = P['iface_preds']..... , but obviously the key 'iface_preds' is not generated by feeding dMaSIF for inference when search is True, so I think there might be a problem here, but still looking into it. I would be honored if you have time to communicate, my email is chenzhiyi22@mails.ucas.ac.cn. All the best!

orange2350 commented 10 months ago

@wjs20 hi,have you let the ROC-AUC higher? thank you

wjs20 commented 10 months ago

Hi orange2350. I havn't touched this repo in over a year. The code is full of bugs and very poorly written, I wouldn't trust any results that come out. The author seems to have abandoned it. I would give up on it if I were you. As @rubenalv said, Masif looks to me like it is better maintained. They are your best bet I think.

orange2350 commented 10 months ago

@wjs20 Hi,wjs20​,thank you for your reply:)

rubenalv commented 10 months ago

@orange2350, on top of that, the output of dmasif --search is embeddings, not a probability of binding like with --site, so you still need to figure out how to use those embeddings. If you have the time, skill and really want dmasif over masif or newer ones, expect 2 weeks to 2 months to sort out all the problems, and then you'll probably would like to retrain the network with a better dataset (https://www.nature.com/articles/s41597-023-02409-3). Don't take it as a discouragement, not at all, but I think it's a realistic assessment of the state of the github.

orange2350 commented 10 months ago

@rubenalv hi,my work may be better carried out with dmasif, so I am prepared to spend a long time, thank you! And the data set you recommend! You may have forgotten, but you have helped me before with the dataset you recommended.anyway,your reply makes me feel very warm on the difficult road of scientific research and exploration. have a nice day!!:)

rubenalv commented 5 months ago

@orange2350, have you made any progress that you could share, by any chance? I am stuck at the point that I have the protein embeddings for protein pairs, but do not know how to implement the MLP (or something more updated) they suggest they use in the paper for assessing the interaction...

orange2350 commented 4 months ago

@rubenalv Hello, I have previously replicated dmasif's work and trained and evaluated it on my own dataset, but I have some other work going on now, so I haven't done this work for a while. The protein embedding you mentioned is to dot-product two embeddings, and the result of the dot-product is the prediction of their interaction interface (the closer to 1, the better), but I have forgotten some of the specific details.​