MobleyLab / Lomap

Alchemical mutation scoring map
MIT License
37 stars 17 forks source link

rdkit cannot kekulize mol to remvoe hydrogens #26

Closed shuail closed 2 years ago

shuail commented 7 years ago

Here is an example of the rdkit cannot kekulize the molecule when it try to remove the hydrogen atoms. To simplify the problem, I just copy and paste the molecule and two line of codes for reading in the molecule as an rdkit mol object and remove the hydrogens. The codes are borrowed from the LOMAP code, so if running using the LOMAP code, it will have the same warning message.

The example.mol2 file looks like below:

@MOLECULE example 46 49 0 0 0 SMALL GASTEIGER

@ATOM 1 C -4.5556 -0.2844 1.1718 C.3 1 LIG1 -0.0109 2 C -6.0291 -0.7271 1.2334 C.3 1 LIG1 0.0493 3 C -6.4413 -0.5958 -1.0493 C.3 1 LIG1 0.0493 4 C -5.1977 0.3130 -1.1927 C.3 1 LIG1 -0.0109 5 C 5.5992 -2.5640 -0.8780 C.ar 1 LIG1 -0.0253 6 O -6.3822 -1.4588 0.0764 O.3 1 LIG1 -0.3796 7 C 2.8943 1.6722 0.9911 C.ar 1 LIG1 0.2664 8 C 5.1745 -2.0407 0.3480 C.ar 1 LIG1 0.1371 9 C -1.6179 0.4017 0.1577 C.ar 1 LIG1 0.2173 10 C -4.0573 -0.1702 -0.2838 C.3 1 LIG1 0.0275 11 C 0.8767 -0.2307 1.1489 C.ar 1 LIG1 0.0370 12 C 2.1438 -0.5325 1.6439 C.ar 1 LIG1 -0.0306 13 C 6.1958 -1.7294 -1.8279 C.ar 1 LIG1 -0.0590 14 C 6.3717 -0.3702 -1.5525 C.ar 1 LIG1 -0.0605 15 C 5.9487 0.1564 -0.3282 C.ar 1 LIG1 -0.0452 16 C 0.6358 1.0320 0.5744 C.ar 1 LIG1 0.1483 17 C -0.1716 -1.1537 1.2042 C.ar 1 LIG1 0.0418 18 C 3.1618 0.4153 1.5592 C.ar 1 LIG1 0.0780 19 C 5.3424 -0.6749 0.6231 C.ar 1 LIG1 0.0480 20 C 1.3530 3.2786 -0.1013 C.3 1 LIG1 0.0167 21 F 4.6032 -2.8623 1.2640 F 1 LIG1 -0.2043 22 S 4.7969 0.0115 2.1898 S.3 1 LIG1 -0.0812 23 N -1.3906 -0.8211 0.7091 N.ar 1 LIG1 -0.2222 24 O 3.8206 2.5277 0.9363 O.2 1 LIG1 -0.2664 25 N 1.6412 1.9659 0.5033 N.ar 1 LIG1 -0.2949 26 N -0.6088 1.3106 0.0937 N.ar 1 LIG1 -0.1964 27 N -2.9091 0.7394 -0.3655 N.pl3 1 LIG1 -0.3104 28 H -3.9262 -1.0225 1.7144 H 1 LIG1 0.0305 29 H -4.4544 0.6942 1.6907 H 1 LIG1 0.0305 30 H -6.1785 -1.3738 2.1237 H 1 LIG1 0.0560 31 H -6.6965 0.1565 1.3647 H 1 LIG1 0.0560 32 H -7.3658 0.0220 -1.0063 H 1 LIG1 0.0560 33 H -6.5227 -1.2302 -1.9574 H 1 LIG1 0.0560 34 H -4.8575 0.3261 -2.2513 H 1 LIG1 0.0305 35 H -5.4753 1.3532 -0.9112 H 1 LIG1 0.0305 36 H 5.4676 -3.6168 -1.0922 H 1 LIG1 0.0646 37 H -3.7461 -1.1771 -0.6436 H 1 LIG1 0.0500 38 H 2.3428 -1.4998 2.0895 H 1 LIG1 0.0638 39 H 6.5237 -2.1362 -2.7758 H 1 LIG1 0.0618 40 H 6.8363 0.2748 -2.2870 H 1 LIG1 0.0618 41 H 6.0904 1.2094 -0.1219 H 1 LIG1 0.0630 42 H -0.0243 -2.1352 1.6372 H 1 LIG1 0.0838 43 H 2.2342 3.9528 -0.1073 H 1 LIG1 0.0457 44 H 0.5450 3.7853 0.4685 H 1 LIG1 0.0457 45 H 1.0258 3.1432 -1.1544 H 1 LIG1 0.0457 46 H -3.0166 1.6655 -0.8392 H 1 LIG1 0.1492 @BOND 1 1 2 1 2 1 10 1 3 2 6 1 4 3 4 1 5 3 6 1 6 4 10 1 7 5 8 ar 8 5 13 ar 9 7 18 ar 10 7 24 2 11 7 25 ar 12 8 19 ar 13 8 21 1 14 9 23 ar 15 9 26 ar 16 9 27 1 17 10 27 1 18 11 12 ar 19 11 16 ar 20 11 17 ar 21 12 18 ar 22 13 14 ar 23 14 15 ar 24 15 19 ar 25 16 25 ar 26 16 26 ar 27 17 23 ar 28 18 22 1 29 19 22 1 30 20 25 1 31 1 28 1 32 1 29 1 33 2 30 1 34 2 31 1 35 3 32 1 36 3 33 1 37 4 34 1 38 4 35 1 39 5 36 1 40 10 37 1 41 12 38 1 42 13 39 1 43 14 40 1 44 15 41 1 45 17 42 1 46 20 43 1 47 20 44 1 48 20 45 1 49 27 46 1

And the example.py code looks like

from rdkit.Chem import AllChem from rdkit import Chem

rdkit_mol = Chem.MolFromMol2File("example.mol2", sanitize=False, removeHs=False) mol = AllChem.RemoveHs(rdkit_mol)

If running the example.py, it will return an error as below:

ValueError: Sanitization error: Can't kekulize mol. Unkekulized atoms: 8 10 11 15 16 17 22 24 25

It seems rdkit cannot understand the molecules when it try to remove the hydrogens, probably related to the format of the mol2 file? I use openbabel to convert the mol2 file from an sdf file.

davidlmobley commented 7 years ago

@shuail - I think the key step is to send this to the RDKit guys. Have you done so?

shuail commented 7 years ago

@davidlmobley I posted this issue on rdkit mail list and the answer from Greg is

"The RDKit Mol2 parser is really only validated for the atom types generated by corina. I'm not surprised that the ouput from open babel would not be understood. This is documented: http://rdkit.org/docs/api/rdkit.Chem.rdmolfiles-module.html#MolFromMol2File"

And after, I ask if there is a plan for rdkit to writeout mol2 format directly, and it seems that someone already worked on that code which is almost ready to merge into the main branch of rdkit. (still need some testing). I will keep posted here if the rdkit implement the mol2 file writer. If that's the case, I could generate the mol2 files directly from rdkit using smile string and feed that mol2 file into LOMAP.

davidlmobley commented 7 years ago

@shuail - why don't you just generate mol2 files directly from SMILES via OEChem?

Though, it's not actually clear to me that this would solve the problem generally, as I just checked the OpenEye docs and I don't find any mention of Corina atom types, so it's not at all clear that RDKit could handle mol2 files written by OEChem either. But maybe it is worth checking this for one of your problem cases.

In my opinion RDKit's position is a little crazy and very frustrating. The mol2 format was invented by TRIPOS and the spec indicates, as I understand it, that the atom types are SYBYL atom types. What RDKit is saying is that they're not going to support SYBYL atom types, so can this format even be legitimately called ".mol2" if it insists on using atom types which are not SYBYL types?

I suppose there's another possibility which actually may be better. Is there a mechanism to BUILD molecules IN RDKit rather than reading them? Something like looping over atoms, creating a bunch of atoms, and connecting them by bonds of specified order? If so then one very general purpose approach would be to:

If someone wanted to put molecules into LOMAP in some other way (such as without OEChem) then it would be their job to figure out how to turn them into RDKit molecules.

This would fix the issue since OEChem's mol2 reader is very general/robust and has basically handled anything I've ever thrown at it (and it would also be able to handle sdf or well-formed PDB files as input, broadening the range of input possibilities), and then once we have an OEMol it should be easy (I think) to convert it to an RDKit molecule by looping over the relevant info. (If this appeals, I can show an example of doing something similar -- turnign and OEMol into an OpenMM Topology and back).

shuail commented 7 years ago

@davidlmobley here is link about the rdkit mol2 file writer, https://github.com/rdkit/rdkit/pull/415, it looks the rdkit developer still plan to use the Sybyl atom types and try to do some final step testing comments by UnixJunkie at the end though I don't know how fast it could finish. Yes, I think I could try to use openeye tools to construct the molecule instead of using openbabel. Hope rdkit will like openeye mol2 file format in general. For the option of building molecule inside RDKit using OEMol objects, that sounds interesting but maybe it's time consuming for me to explore how RDKit build topologies from scratch, but I will keep this suggestion in mind. Thx!

UnixJunkie commented 7 years ago

you should use sdf is you want rdkit to eat your molecules

davidlmobley commented 7 years ago

@UnixJunkie - the issue with SDF is that it doesn't carry partial charges, and LOMAP needs partial charges.

I suppose it could be redesigned to avoid this -- @shuail , we could work based on formal charges at the planning stage if needed. Still, it seems silly to have to go to a lot of effort to avoid mol2 format just because RDKit doesn't like it when it is a perfectly good format for the rest of what we want to do (probably the best present-day format for getting small molecules into molecular simulations).

davidlmobley commented 7 years ago

@shuail - it looks to me like it's rather trivial to create molecules and atoms with RDKit; Googling for 30 seconds leads to this: http://www.rdkit.org/Python_Docs/rdkit.Chem.rdchem.EditableMol-class.html

It looks like once you have a molecule, you can easily AddAtom, AddBond, and perform the equivalent removal operations -- e.g. see http://www.rdkit.org/Python_Docs/rdkit.Chem.rdchem.EditableMol-class.html . There's even an example of this usage in the "getting started" info (http://www.rdkit.org/docs/GettingStartedInPython.html):

More complex transformations can be carried out using the RWMol class:

>>> m = Chem.MolFromSmiles('CC(=O)C=CC=C')
>>> mw = Chem.RWMol(m)
>>> mw.ReplaceAtom(4,Chem.Atom(7))
>>> mw.AddAtom(Chem.Atom(6))
7
>>> mw.AddAtom(Chem.Atom(6))
8
>>> mw.AddBond(6,7,Chem.BondType.SINGLE)
7
>>> mw.AddBond(7,8,Chem.BondType.DOUBLE)
8
>>> mw.AddBond(8,3,Chem.BondType.SINGLE)
9
>>> mw.RemoveAtom(0)
>>> mw.GetNumAtoms()
8

So, it looks like you should be able to generate an RDKit molecule from another source in a very similar way to how we generate an OpenMM topology here (https://github.com/open-forcefield-group/openforcefield/blob/master/openforcefield/utils/utils.py#L157) from an OEMol. Instead you could go from OEMol to RDKit molecule.

UnixJunkie commented 7 years ago

I also want mol2 support in rdkit ...

On 19 Aug 2017, at 01:24, David L. Mobley notifications@github.com wrote:

@shuail - it looks to me like it's rather trivial to create molecules and atoms with RDKit; Googling for 30 seconds leads to this: http://www.rdkit.org/Python_Docs/rdkit.Chem.rdchem.EditableMol-class.html

It looks like once you have a molecule, you can easily AddAtom, AddBond, and perform the equivalent removal operations -- e.g. see http://www.rdkit.org/Python_Docs/rdkit.Chem.rdchem.EditableMol-class.html . There's even an example of this usage in the "getting started" info (http://www.rdkit.org/docs/GettingStartedInPython.html):

More complex transformations can be carried out using the RWMol class:

m = Chem.MolFromSmiles('CC(=O)C=CC=C') mw = Chem.RWMol(m) mw.ReplaceAtom(4,Chem.Atom(7)) mw.AddAtom(Chem.Atom(6)) 7 mw.AddAtom(Chem.Atom(6)) 8 mw.AddBond(6,7,Chem.BondType.SINGLE) 7 mw.AddBond(7,8,Chem.BondType.DOUBLE) 8 mw.AddBond(8,3,Chem.BondType.SINGLE) 9 mw.RemoveAtom(0) mw.GetNumAtoms() 8 ― You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

shuail commented 7 years ago

@davidlmobley I just installed the OEChem and test all my existing systems, the mol2 files generated by OEChem work perfectly with the rdkit, it will not report the "cannot kekulize errors". I will then continue to work on the more general approach which use the rdkit constructor to create the rdmol object to avoid using the rdkit mol2 file reader.

davidlmobley commented 7 years ago

Ah, that's great that you have an interim fix, then. :)

maxjump commented 5 years ago

I have also been running into this error message when using mol2 files generated with openbabel 2.3.2. However I find it somewhat irritating that the "kekulization" works fine for most of the mol2 files generated this way. Do you have an idea whether this error is caused by certain atom names from the Sybyl atom types?

davidlmobley commented 5 years ago

I'd guess it's that RDKit strictly only supports Corina atom types when using its mol2 reader (which is rather limited) and many other tools like Sybyl atom types. So you probably have atom types RDKit can't interpret coming in on occasion.

A better option may be to shift (in a few months) to SDF format; RDKit is shifting to an SDF format which supports partial charges via a standard tag that OEChem will also support, and it has better SDF support than mol2 support.