Closed nividic closed 2 years ago
I think there are three issues here:
Apparently, you and @halx have problem cases relating to items 1 and 2. Particularly, you report that without sanitization, sometimes ring perception with RDKit fails. And Hannes reports that sanitization sometimes fails. So one would suggest that we should sanitize, and the other would suggest that we shouldn't. Obviously we have to resolve this somehow.
Can you guys provide minimal examples to reproduce those issues? It seems a key question is are these input bugs or RDKit bugs/limitations? From a "chemical graph" perspective, one should be able to perceive rings without perceiving aromaticity, though possibly RDKit has done something strange here.
Ideally, I would think we'd be doing some (but not that much) sanitization on input user structures, just to minimize dependence of the results on whether they do things like kekulize before input or not, and we should be able to perceive rings in all well-formed cases. I am not that concerned about giving good behavior if users provide garbage structures as input (we should just raise some sensible exception).
But, issue 3 should be entirely separate. Originally, we included aromaticity as a criteria in deciding whether we could break rings when necessary but our subsequent work shows aromaticity is not a significant determinant of accuracy in these cases, so we don't need to keep using it in that way.
So overall my answers are:
I checked all the old lomap test systems. Effectively if you want just ring info we DO NOT need to sanitize the molecule. However if you want the aromaticity info you have to sanitize. I got confused. The implemented code is correct because we were considering the aromaticity
There may still be a case to sanitize sometimes as it would allow us to handle more cases of "weird user input", but we shouldn't do that if sanitization fails. @halx , can you give us an example of something where sanitization fails?
Just to have this noted here. The problem with kekulizing was triggered as follows.
File "/usr/local/lib/python2.7/dist-packages/lomap-0.0.0-py2.7.egg/lomap/graphgen.py", line 840, in draw
mol = AllChem.RemoveHs(self.dbase[id_mol].getMolecule())
ValueError: Sanitization error: Can't kekulize mol
A short Python example to demonstrate this is:
import sys
import rdkit.Chem as rd
mol = rd.MolFromMol2File(sys.argv[1], sanitize=True, removeHs=False)
on 2-methylindole (see below) and is eventually a problem of the BOND section. Omitting it is fine which may suggest that RDKit has problems with how bonds are pre-defined. I suggest to have a look into Chem.SanitizeMol for fine-tuning.
@<TRIPOS>MOLECULE
2-methylindole
19 20 0 0 0
SMALL
GASTEIGER
@<TRIPOS>ATOM
1 C 1.2700 -0.1550 0.1940 C.3 1 LIG1 -0.0244
2 C 2.7030 -0.0190 -0.1620 C.ar 1 LIG1 0.0139
3 N 3.5000 -1.1160 -0.3830 N.ar 1 LIG1 -0.3581
4 C 4.7770 -0.7250 -0.7190 C.ar 1 LIG1 0.0467
5 C 5.9020 -1.4920 -1.0380 C.ar 1 LIG1 -0.0378
6 C 7.0830 -0.8100 -1.3480 C.ar 1 LIG1 -0.0598
7 C 7.1330 0.5820 -1.3340 C.ar 1 LIG1 -0.0611
8 C 5.9980 1.3340 -1.0090 C.ar 1 LIG1 -0.0526
9 C 4.7970 0.6680 -0.6970 C.ar 1 LIG1 -0.0001
10 C 3.4890 1.1000 -0.3410 C.ar 1 LIG1 -0.0337
11 H 0.7990 0.8270 0.3040 H 1 LIG1 0.0292
12 H 1.1570 -0.6950 1.1380 H 1 LIG1 0.0292
13 H 0.7310 -0.7020 -0.5860 H 1 LIG1 0.0292
14 H 3.1970 -2.0780 -0.3210 H 1 LIG1 0.1656
15 H 5.8670 -2.5770 -1.0500 H 1 LIG1 0.0638
16 H 7.9760 -1.3760 -1.6050 H 1 LIG1 0.0618
17 H 8.0610 1.0920 -1.5800 H 1 LIG1 0.0618
18 H 6.0430 2.4190 -1.0010 H 1 LIG1 0.0624
19 H 3.1600 2.1260 -0.2370 H 1 LIG1 0.0641
@<TRIPOS>BOND
1 1 2 1
2 1 11 1
3 1 12 1
4 1 13 1
5 2 10 ar
6 2 3 ar
7 3 4 ar
8 3 14 1
9 4 9 ar
10 4 5 ar
11 5 6 ar
12 5 15 1
13 6 7 ar
14 6 16 1
15 7 8 ar
16 7 17 1
17 8 9 ar
18 8 18 1
19 9 10 ar
20 10 19 1
One more note: I have checked-in a new molecule reader which uses OpenBabel as pre-processor. This seems to help too.
EDIT: The conversion is done through the MDL mol format. Openbabel reads the file format it understands and creates a string in the MDL format from it, RDKIt parses the molecule from the string.
EDIT 2: The MDL format does not seem to have an aromatic bond type but uses '2' instead.
Hmm. @halx - does the MDL format have charges? I'm a bit hesitant to pass through another format as it might result in information loss. (Not that we use that info at present, but sometimes there is an expectation that if you bring certain information in, you should be able to retain it on output...)
Anyway, I think the take-aways here probably are: 1) We should stop perceiving aromaticity since we no longer need it for scoring 2) We can optionally sanitize input molecules to allow a broader range of user inputs to result in well-behaved output, but this should be optional
Thanks.
Charge is a bit of a problem in any case...
The MDL format can store total charge and/or integer atom charges but I do not see that it could store partial charges. The Tripos mol2 format has atom charges and, I believe, the total charge in @
RDKit appears to have problems with kekulization with at least some heterocycles. But even if we are interested in aromaticity (and that should reasonable work with any chemoinfromatics code anyways), why would we need to know the Kekulé structure (which effectively is to "unaromatimize" the molecule)?
http://www.rdkit.org/Python_Docs/rdkit.Chem.rdmolops.SanitizeFlags-class.html summarises what sanitation actual does.
Yes, those are good points. I think the more I think about it I think we should be assuming that this is being run on molecules which are "already prepped" for molecular simulations, which means perception has already been done, the molecules are clean, and (at least in the AMBER world) partial charges already assigned via something like AM1-BCC or RESP and are in the .mol2 files, as that would I think be the most common use case (i.e. I'm ready to proceed from doing this directly to topology generation without doing more processing on the molecules first).
Does that make sense to you too, @halx ?
Well, in my workflow charge derivation would come after Lomap. FESetup does have some routines to handle this.
I have just checked the mol2 reader from RDKIt and it actually does read @
So an option would be do get either RDKit or OB to enhance their codes. As far as OB goes I'd better do that myself and offer a patch to them (and finally get Noel O'Boyle to implement another patch I have proposed). For RDKit it may be more work to implement a Mol2Supplier class (in analogy to the SDFSupplier). I seem to sense though that mol2 is not a particularly favourite of the chemoinformatics people. Looks like they prefer more the SDF container format (based on the MDL mol format).
So an option would be do get either RDKit or OB to enhance their codes. As far as OB goes I'd better do that myself and offer a patch to them (and finally get Noel O'Boyle to implement another patch I have proposed). For RDKit it may be more work to implement a Mol2Supplier class (in analogy to the SDFSupplier). I seem to sense though that mol2 is not a particularly favourite of the chemoinformatics people. Looks like they prefer more the SDF container format (based on the MDL mol format).
Yes, definitely they ought to support what the formats support with respect to charge. Definitely LOMAP will need to know the net charges of the molecules involved.
I'm not opposed to LOMAP supporting input from SDF as well as MOL2 (I somewhat like the SDF format myself), though as you note SDF does not have partial charges. We've put partial charges in SDF's before, but they have to be stored as custom tags, which is not really a super general way to keep information hanging around in an SDF (i.e. there's a risk that tools might strip out tags they don't understand).
Well, ok, so to try and wrap up this issue, I think we can probably close this if we agree on the conclusions, which I think are:
1) We'll remove aromaticity checking 2) Sanitization should at most be optional 3) We need charges on our molecules, so @halx will make sure OpenBabel and RDKit do read charge information in .mol2 files when provided (?)
Does that cover it?
We need the molecule's total charge but why do we need partial atom charges?
Sorry, I meant total charge, @halx . The prior comment relating to partial charge was really just about what info can go into what file format, and what file format I prefer. :)
As discussed by emails currently the code requires the sanitization of the loaded molecules. This is necessary and used in different part of the current code (mainly in the mcs.py file). So far the code requires information about rings and ring aromaticity (the code is following the original paper and the original code approach). I can show examples where without the proper sanitization RDkit is not able to extract ring info. Therefore code like this for example:
is not always going to work without a proper sanitization. I’m going to double check this to see if we can use some other Rdkit functionality. In addition apparently we would like to skip the ring aromaticity part (to allow transformations flat to non flat rings (please correct me if I’m wrong)) keeping just the extraction of the ring info. This will require many changes in the code because the ring aromaticity was a key feature and need to be expensively discussed here before I’m going to modify this part.