deepmodeling / Uni-Mol

Official Repository for the Uni-Mol Series Methods
MIT License
663 stars 118 forks source link

Unrecognized atom type #255

Open CLG68 opened 1 month ago

CLG68 commented 1 month ago

Hi,

With some molecules I get (Unimol Docking V2):

/media/christian/VS1/VS/Results_Unimol/MC4R_protein/Poses/Sublibrary_05/CHEMBL-3740791-1.sdf-Cc1ccnc(N(CCC(=O)[O-])C(=O)c2ccc3c(c2)nc(CNc2ccc(C(N)=[NH2+])cc2F)n3C)c1-RMSD:173.775 [02:07:56] UFFTYPER: Unrecognized atom type: S_6+6 (0) /media/christian/VS1/VS/Results_Unimol/MC4R_protein/Poses/Sublibrary_05/Enamine-Z3019139935-2.sdf-Cc1cc(N2CCC(O)(C[NH+]3CCOCC3)CC2)nc(N(C)c2ccccc2)[nH+]1-RMSD:171.117 3%|█▎ | 63/1959 [01:50<50:00, 1.58s/it][02:07:56] UFFTYPER: Unrecognized atom type: S_6+6 (0) [02:07:57] UFFTYPER: Unrecognized atom type: S_6+6 (0) [02:07:58] UFFTYPER: Unrecognized atom type: S_6+6 (0) [02:07:58] UFFTYPER: Unrecognized atom type: S_6+6 (0) [02:07:58] UFFTYPER: Unrecognized atom type: S_6+6 (0) [02:07:59] UFFTYPER: Unrecognized atom type: S_6+6 (0) [02:07:59] UFFTYPER: Unrecognized atom type: S_6+6 (0) [02:07:59] UFFTYPER: Unrecognized atom type: S_6+6 (0) /media/christian/VS1/VS/Results_Unimol/MC4R_protein/Poses/Sublibrary_05/ChemDiv-V014-0652-1.sdf-CC(C)CCN(CC(=O)Nc1cc(C(C)(C)C)nn1-c1ccc(Cl)cc1)C(=O)C(C)(C)CCl-RMSD:173.7905 [02:07:59] UFFTYPER: Unrecognized atom type: S_6+6 (0) [02:07:59] UFFTYPER: Unrecognized atom type: S_6+6 (0) 3%|█▎ | 64/1959 [01:54<1:02:15, 1.97s/it][02:08:00] UFFTYPER: Unrecognized atom type: S_6+6 (0) [02:08:00] UFFTYPER: Unrecognized atom type: S_6+6 (0) [02:08:00] UFFTYPER: Unrecognized atom type: S_6+6 (0)

It does it even if I use the latest version of RDKIT.

CLG68 commented 1 month ago

Maybe it is related to this: https://github.com/rdkit/rdkit/issues/6365 but I'm currently using the latest RDKIT so it should have been fixed.
I also get: UFFTYPER: Unrecognized atom type: S_5+6

CLG68 commented 1 month ago

I screened 100000 structures from a focussed library from a Panther/ShaEP VS, on Unimol docking V2. I had a hard time with rescoring the resuts as 650 poses either had "nan" as coordinates or were out the binding pocket. So I made a script to clean the results before rescoring. Maybe this is coming from the problem I repported (UFFTYPER: Unrecognized atom type: S_5+6)? Do you know how to correct this problem?

Thanks Christian

ZhouGengmo commented 1 month ago

[02:07:59] UFFTYPER: Unrecognized atom type: S_6+6 (0)

It looks like there is an issue with RDKit when loading the file. Could you provide a file that produces this error? We can test it further.

CLG68 commented 1 month ago

Thank you v much for helping with this. I attached the target, the json file, the ref ligand used for generating the json file as well as ex of structures giving me errors or problematic results. The source-structures are extracted from my library. The generated-poses are from Unimol docking V2. The structures that give me a problem with valence do not generate a binding pose. I had to create a script to clean the docking results as the poses with no coordinates or outside of the binding pocket were creating problems with scoring in the training with Brutenib... ShaEP was just thinking forever.

The library is from the top 1% scores from a Panther/ShaEP VS. My cleaning script flagged 670 poses of around 100k minus all the poses not generated because of the valence problem.

For RDKit, I tried the version suggested on your read.me file and also the latest version. Updating to the latest version did not solve the problem.

Best, Christian Unimol-Docking-V2_clg68.zip

CLG68 commented 3 weeks ago

Hi, Was it ok in a zip archive or it would be better as individual files? Thank you, Christian

ZhouGengmo commented 2 weeks ago

Sorry for the delayed response.

Regarding the bug in RDKit, it seems that the bug mentioned in the original issue still exists. I am using an almost up-to-date version (2024.3.1, installed via pip), but when I run the example code from the issue:

mol = Chem.MolFromSmiles("S(F)(F)(F)(F)F")
mol = Chem.AddHs(mol)
AllChem.EmbedMolecule(mol, randomSeed=42)
conf = mol.GetConformer()
print(conf)

The output is:

<rdkit.Chem.rdchem.Conformer object at 0x7fc17b931b60>
[09:45:04] UFFTYPER: Unrecognized atom type: S_6+6 (0)

I also ran the example file you provided. The command I used is as follows:

python demo.py --mode single --conf-size 10 --cluster \
    --input-protein Unimol-Docking-V2_clg68/MC4R_protein.pdb \
    --input-ligand Unimol-Docking-V2_clg68/MC4R_ref-ligand.sdf \
    --input-docking-grid Unimol-Docking-V2_clg68/docking_grid.json \
    --output-ligand-name ligand_predict \
    --output-ligand-dir predict_sdf \
    --steric-clash-fix \
    --model-dir unimol_docking_v2_240517.pt

There was no Unrecognized atom type: S_6+6 (0) error, and the script ran as expected. Part of the output message is:

[09:55:28] Warning: molecule is tagged as 2D, but at least one Z coordinate is not zero. Marking the mol as 3D.
predict_sdf/ligand_predict.sdf-Cn1nnc(CC2(C3CCCCC3)CCN(C(=O)C(Cc3ccc(Cl)cc3)NC(=O)C3Cc4ccccc4CN3)CC2)n1-RMSD:4.5583
CLG68 commented 2 weeks ago

Thank you very much for running some tests with my files. Many docking poses are missing/rejected from the screen because of the "Unrecognized atom type error", of poses without coordinates and molecules docked outside the binding pocket; so I'm really interested in resolving this problem. I'll try RDKit 2024.3.1, and investigate the "is tagged as 2D" message. Hopefully it will solve the "Unrecognized atom type: S_6+6 (0)" problem.

Best, Christian