Roestlab / massformer

Tandem Mass Spectrum Prediction with Graph Transformers
BSD 2-Clause "Simplified" License
65 stars 22 forks source link

WARNING about input molecules #6

Open CesareWang opened 1 month ago

CesareWang commented 1 month ago

Thank you for your outstanding work and sharing! This WARNING occured while predicting the mass spectra of some smiles codes using your pre-trained model: WOU9_LQM5M@62)}X`83GM{F The model then stops continuing the prediction. This suggests that it may be due to a problem with the input smiles codes, but I checked the 1840th smiles code entered: ‘CC1=CC2Cc3nc4cc(Cl)cccc4c(NCCCNCCCNc4c5c(nc6ccccc46)CCCC5)c3C(C1)C2 ‘. Locally rdkit recognises this smiles code without this WARNING. What could be the reason for this kind of problem and how can I change the code or check in advance if the input smiles code can be used for prediction.

Thanks again for your outstanding work. I appreciate all your help!

adamoyoung commented 1 month ago

Hi CesareWang,

Thanks for your interest and kind words!

I tried parsing the compound that you provided with the version of rdkit used for the project (2021.03.3) and it was unable to parse. Can you confirm that you are using the correct rdkit version?

In any case, you could simply modify the code (or your input file) to skip this compound.

CesareWang commented 1 month ago

Thank you very much for your reply and help!

As your valuable suggestion, I have filtered out the smiles codes in the input data that cannot be processed properly. The processed smiles code can perform the series of operations from "mol_from_smiles" to "init_from_smiles" normally. However, when executing "inference", a new issue appeared. May I ask what could be the reason for this issue and how can I solve this error? image

Sincerely thank you for your help!

adamoyoung commented 1 month ago

It seems like the preprocessing failed, since there is an instance of a molecule that is not a SMILES string or an RDKit mol object. Maybe you should take a look at that one?