Molecule RDkit sanitization

nividic commented 8 years ago

As discussed by emails currently the code requires the sanitization of the loaded molecules. This is necessary and used in different part of the current code (mainly in the mcs.py file). So far the code requires information about rings and ring aromaticity (the code is following the original paper and the original code approach). I can show examples where without the proper sanitization RDkit is not able to extract ring info. Therefore code like this for example:

ring_info = mol.GetRingInfo()
rings = ring_info.AtomRings()

for ring in rings:
 if idx in ring:
   found = True
   break

is not always going to work without a proper sanitization. I’m going to double check this to see if we can use some other Rdkit functionality. In addition apparently we would like to skip the ring aromaticity part (to allow transformations flat to non flat rings (please correct me if I’m wrong)) keeping just the extraction of the ring info. This will require many changes in the code because the ring aromaticity was a key feature and need to be expensively discussed here before I’m going to modify this part.

davidlmobley commented 8 years ago

I think there are three issues here:

How to perceive rings
Whether and how to sanitize molecules (and if this is necessary for item 1)
Whether we need to do aromaticity checking to decide what transformations to do

Apparently, you and @halx have problem cases relating to items 1 and 2. Particularly, you report that without sanitization, sometimes ring perception with RDKit fails. And Hannes reports that sanitization sometimes fails. So one would suggest that we should sanitize, and the other would suggest that we shouldn't. Obviously we have to resolve this somehow.

Can you guys provide minimal examples to reproduce those issues? It seems a key question is are these input bugs or RDKit bugs/limitations? From a "chemical graph" perspective, one should be able to perceive rings without perceiving aromaticity, though possibly RDKit has done something strange here.

Ideally, I would think we'd be doing some (but not that much) sanitization on input user structures, just to minimize dependence of the results on whether they do things like kekulize before input or not, and we should be able to perceive rings in all well-formed cases. I am not that concerned about giving good behavior if users provide garbage structures as input (we should just raise some sensible exception).

But, issue 3 should be entirely separate. Originally, we included aromaticity as a criteria in deciding whether we could break rings when necessary but our subsequent work shows aromaticity is not a significant determinant of accuracy in these cases, so we don't need to keep using it in that way.

So overall my answers are:

We need to always be able to perceive rings, so if RDKit is broken we need to get it fixed or perceive in a different way.
We should probably do SOME sanitization just to ensure we don't have substantial dependence on the precise form of the input structures; if there are well-formed cases RDKit can't sanitize (i.e. RDKit bugs) then we either (a) need to get RDKit to fix them (likely by providing examples of what is broken), or (b) implement some sort of careful checking of the input structures to ensure we tell the user when they need to sanitize better first.

nividic commented 8 years ago

I checked all the old lomap test systems. Effectively if you want just ring info we DO NOT need to sanitize the molecule. However if you want the aromaticity info you have to sanitize. I got confused. The implemented code is correct because we were considering the aromaticity

davidlmobley commented 8 years ago

There may still be a case to sanitize sometimes as it would allow us to handle more cases of "weird user input", but we shouldn't do that if sanitization fails. @halx , can you give us an example of something where sanitization fails?

halx commented 8 years ago

Just to have this noted here. The problem with kekulizing was triggered as follows.

  File "/usr/local/lib/python2.7/dist-packages/lomap-0.0.0-py2.7.egg/lomap/graphgen.py", line 840, in draw
    mol = AllChem.RemoveHs(self.dbase[id_mol].getMolecule())
ValueError: Sanitization error: Can't kekulize mol

A short Python example to demonstrate this is:

import sys
import rdkit.Chem as rd
mol = rd.MolFromMol2File(sys.argv[1], sanitize=True, removeHs=False)

on 2-methylindole (see below) and is eventually a problem of the BOND section. Omitting it is fine which may suggest that RDKit has problems with how bonds are pre-defined. I suggest to have a look into Chem.SanitizeMol for fine-tuning.

@<TRIPOS>MOLECULE
2-methylindole
 19 20 0 0 0
SMALL
GASTEIGER

@<TRIPOS>ATOM
      1  C          1.2700   -0.1550    0.1940 C.3     1  LIG1       -0.0244
      2  C          2.7030   -0.0190   -0.1620 C.ar    1  LIG1        0.0139
      3  N          3.5000   -1.1160   -0.3830 N.ar    1  LIG1       -0.3581
      4  C          4.7770   -0.7250   -0.7190 C.ar    1  LIG1        0.0467
      5  C          5.9020   -1.4920   -1.0380 C.ar    1  LIG1       -0.0378
      6  C          7.0830   -0.8100   -1.3480 C.ar    1  LIG1       -0.0598
      7  C          7.1330    0.5820   -1.3340 C.ar    1  LIG1       -0.0611
      8  C          5.9980    1.3340   -1.0090 C.ar    1  LIG1       -0.0526
      9  C          4.7970    0.6680   -0.6970 C.ar    1  LIG1       -0.0001
     10  C          3.4890    1.1000   -0.3410 C.ar    1  LIG1       -0.0337
     11  H          0.7990    0.8270    0.3040 H       1  LIG1        0.0292
     12  H          1.1570   -0.6950    1.1380 H       1  LIG1        0.0292
     13  H          0.7310   -0.7020   -0.5860 H       1  LIG1        0.0292
     14  H          3.1970   -2.0780   -0.3210 H       1  LIG1        0.1656
     15  H          5.8670   -2.5770   -1.0500 H       1  LIG1        0.0638
     16  H          7.9760   -1.3760   -1.6050 H       1  LIG1        0.0618
     17  H          8.0610    1.0920   -1.5800 H       1  LIG1        0.0618
     18  H          6.0430    2.4190   -1.0010 H       1  LIG1        0.0624
     19  H          3.1600    2.1260   -0.2370 H       1  LIG1        0.0641
@<TRIPOS>BOND
     1     1     2    1
     2     1    11    1
     3     1    12    1
     4     1    13    1
     5     2    10   ar
     6     2     3   ar
     7     3     4   ar
     8     3    14    1
     9     4     9   ar
    10     4     5   ar
    11     5     6   ar
    12     5    15    1
    13     6     7   ar
    14     6    16    1
    15     7     8   ar
    16     7    17    1
    17     8     9   ar
    18     8    18    1
    19     9    10   ar
    20    10    19    1

halx commented 8 years ago

One more note: I have checked-in a new molecule reader which uses OpenBabel as pre-processor. This seems to help too.

EDIT: The conversion is done through the MDL mol format. Openbabel reads the file format it understands and creates a string in the MDL format from it, RDKIt parses the molecule from the string.

EDIT 2: The MDL format does not seem to have an aromatic bond type but uses '2' instead.

davidlmobley commented 8 years ago

Hmm. @halx - does the MDL format have charges? I'm a bit hesitant to pass through another format as it might result in information loss. (Not that we use that info at present, but sometimes there is an expectation that if you bring certain information in, you should be able to retain it on output...)

Anyway, I think the take-aways here probably are: 1) We should stop perceiving aromaticity since we no longer need it for scoring 2) We can optionally sanitize input molecules to allow a broader range of user inputs to result in well-behaved output, but this should be optional

Thanks.

halx commented 8 years ago

Charge is a bit of a problem in any case...

The MDL format can store total charge and/or integer atom charges but I do not see that it could store partial charges. The Tripos mol2 format has atom charges and, I believe, the total charge in @UNITY_ATOM_ATTR. But none of the programs I know seem to read the latter. (Do you happen to have Sybyl or know someone who does to find out how a mol2 looks like for a charged molecule?). A mol2 file can also come with no charges so relying on the partial charges alone is not reliable either unless you rely on the charge derivation Openbabel or RDKit have. This of course only works when perception has worked... (chicken or egg?)

RDKit appears to have problems with kekulization with at least some heterocycles. But even if we are interested in aromaticity (and that should reasonable work with any chemoinfromatics code anyways), why would we need to know the Kekulé structure (which effectively is to "unaromatimize" the molecule)?

http://www.rdkit.org/Python_Docs/rdkit.Chem.rdmolops.SanitizeFlags-class.html summarises what sanitation actual does.

davidlmobley commented 8 years ago

Yes, those are good points. I think the more I think about it I think we should be assuming that this is being run on molecules which are "already prepped" for molecular simulations, which means perception has already been done, the molecules are clean, and (at least in the AMBER world) partial charges already assigned via something like AM1-BCC or RESP and are in the .mol2 files, as that would I think be the most common use case (i.e. I'm ready to proceed from doing this directly to topology generation without doing more processing on the molecules first).

Does that make sense to you too, @halx ?

halx commented 8 years ago

Well, in my workflow charge derivation would come after Lomap. FESetup does have some routines to handle this.

I have just checked the mol2 reader from RDKIt and it actually does read @UNITY_ATOM_ATTR. RDKit may actually rely on it because the problem with pyrrole, for instance, is that RDKit thinks that N carries a positive charge when all heavy atom bonds are tagged as aromatic in the mol2 file and it has no further information about the total charge. That may be the reason why kekulizing doesn't work. The reason why I chose the Openbabel mol2 reader is because OB can read all molecules. RDKit is a bit weak on mol2 reading it seems but actually does read atom charges what OB does not do.

So an option would be do get either RDKit or OB to enhance their codes. As far as OB goes I'd better do that myself and offer a patch to them (and finally get Noel O'Boyle to implement another patch I have proposed). For RDKit it may be more work to implement a Mol2Supplier class (in analogy to the SDFSupplier). I seem to sense though that mol2 is not a particularly favourite of the chemoinformatics people. Looks like they prefer more the SDF container format (based on the MDL mol format).

davidlmobley commented 8 years ago

So an option would be do get either RDKit or OB to enhance their codes. As far as OB goes I'd better do that myself and offer a patch to them (and finally get Noel O'Boyle to implement another patch I have proposed). For RDKit it may be more work to implement a Mol2Supplier class (in analogy to the SDFSupplier). I seem to sense though that mol2 is not a particularly favourite of the chemoinformatics people. Looks like they prefer more the SDF container format (based on the MDL mol format).

Yes, definitely they ought to support what the formats support with respect to charge. Definitely LOMAP will need to know the net charges of the molecules involved.

I'm not opposed to LOMAP supporting input from SDF as well as MOL2 (I somewhat like the SDF format myself), though as you note SDF does not have partial charges. We've put partial charges in SDF's before, but they have to be stored as custom tags, which is not really a super general way to keep information hanging around in an SDF (i.e. there's a risk that tools might strip out tags they don't understand).

davidlmobley commented 8 years ago

Well, ok, so to try and wrap up this issue, I think we can probably close this if we agree on the conclusions, which I think are:

1) We'll remove aromaticity checking 2) Sanitization should at most be optional 3) We need charges on our molecules, so @halx will make sure OpenBabel and RDKit do read charge information in .mol2 files when provided (?)

Does that cover it?

halx commented 8 years ago

We need the molecule's total charge but why do we need partial atom charges?

davidlmobley commented 8 years ago

Sorry, I meant total charge, @halx . The prior comment relating to partial charge was really just about what info can go into what file format, and what file format I prefer. :)

MobleyLab / Lomap

Molecule RDkit sanitization #4