Closed bannanc closed 7 years ago
Oh dear.
I asked @bannanc to post this. I've previously filtered for duplicates, so I'll have to dig in and figure out what's different about my prior duplicate filtering versus her duplicate filtering. Though, er, it does look somewhat like it could be an issue of chirality for three of these since the compound names clearly indicate chirality not reflected in the SMILES Caitlin has. It's not clear WHY this would be, though, as she does appear to be using isomeric SMILES strings. But, well, (2R)-1,1,1-trifluoropropan-2-ol
vs (2S)-1,1,1-trifluoropropan-2-ol
...
But 2-acetoxyethyl acetate
vs 2-acetoxyethyl acetate
?
Sounds like an erratum may be in our future.
Aren't there ways to indicate the chirality in the SMILES? I thought isomeric SMILES were supposed to include chirality, is there a different OE function I should use to get the SMILES?
I thought it was just OEMolToSmiles
.
How do we prevent these issues from happening in the future? What can we do to refine the process to avoid these kinds of mistakes?
Ok, if I use OEMolToSmiles the only duplicate I see is the 2-acetoxyethyl acetate
@jchodera :
How do we prevent these issues from happening in the future? What can we do to refine the process to avoid these kinds of mistakes?
These are all results of the original "data archeology" process we're not able to curate. Specifically, for every single data point here, at some point in the past, some human (not necessarily us -- many of these come from Rizzo's compilations or even earlier compilations) took a structure in a table in a paper and did SOMETHING to it to get a name and a molecular structure which ended up in a mol2 file. Because this was a time consuming, human-intensive process, it was error prone (names and structures not matching, duplicate molecules, names and structures being consistent but not matching the intended molecule, etc.). I've been able to detect and remove many of the errors over the years through the various curation steps I did, but it seems like each new time I/we come up with a slightly different way of processing the whole thing we come up with one or two new issues. I'm quite confident that there is NO way of making sure the whole thing is perfect. (Even if you got a magical robot which could redo all of the experiments in an automated way, generate all of the experimental data from scratch, and re-create all of the structures/IUPAC names/SMILES all in one go, you'd STILL have the problem that some of the compound vendors will have sent you the wrong compounds, etc.)
One could in principle go back to the original literature and pull all of the data again and cross-check against what we have here, but that would be equally time-consuming, human-intensive, and error-prone, not to mention the fact that some of what is here actually represents CORRECTIONS to the literature (finding mistakes in literature tables, etc.).
The idea of an erratum reminds me of one mistake I made in the latest FreeSolv update paper. In the PREVIOUS paper, I had planned that I would not do erratums unless they would significantly affect our conclusions, so I indicated clearly that all further updates to the database would be made on the FreeSolv repo itself. I forgot to do that in this paper, so we may need to do an erratum that (a) adds any corrections resulting from this issue, and (b) makes clear that all further updates will be made on the GitHub repo rather than via erratum.
(Errata are a terrible place for corrections to databases since one potentially might need to make many such corrections, such as if new experimental values become available or existing ones are better curated.)
@bannanc :
I'd always used code more like yours:
oechem.OECreateIsoSmiString(mol)
So I'm curious to understand the differnece between these.
@davidlmobley That was the impression as well. I had an issue with smirky where I use a molecule's SMILES string as a dictionary key. When using OEMolToSmiles didn't work, it would regenerate a SMILES string for a molecule and wouldn't be able to find it in the dictionary, but if I used OECreateIsoSmiString it always creates the same SMILES string. However it looks like OECreateIsoSmiString doesn't include the characters to indicate chirality/isomers.
However it looks like OECreateIsoSmiString doesn't include the characters to indicate chirality/isomers.
Hmm, that seems very odd, as I've used it for this many times in the past. I'm thinking there's something specific in how you're using it here (perhaps what processing you have or have not done on the molecule first) that is making it not provide this info. I'll have to dig in.
OK, so to update on this:
OEMolToSmiles
is what we want; I'm checking with Support on why OECreateIsoSmiString sometimes leads to different behavior (the docs leave it unclear and suggest BOTH for generating canonical isomeric SMILES)While having duplicates is bad, this is about as benign a duplicate as could possibly happen, in that the experimental value reported in both cases was identical, and the calculated values are within uncertainty of one another, so the overall effect is minor.
For the record, this is info from James Haigh at OpenEye support:
It looks like we need to update the glossary part of the documentation to use OEMolToSmiles rather than OECreateIsoSmiString. OECreateIsoSmiString absolutely creates a canonical isomeric SMILES but only of the exact molecule that is present. OEMolToSmiles performs several perception calls on the molecule to ensure more consistency in the SMILES output.
Basically if you are reading molecules from different input sources they may be perceived in multiple different ways depending on the input file format or the method used to read. There are also multiple aromaticity models. OEMolToSmiles does perception to ensure consistency e.g. applying the OpenEye aromaticity model, perceiving stereochemistry etc.
A simple case is the Kekulé form of benzene. If I read that using OEParseSmiles and the generate a SMILES using OECreateIsoSmiString then I get C1=CC=CC=C1 out. But with either OEReadMolecule to read it, or OEMolToSmiles to generate a SMILES, I get c1ccccc1. See attached example.
Please let me know if you have any questions.
While typing FreeSolv molecules with smirnoff99Frosst, I found 4 molecules that are potentially duplicated in the FreeSolv set. Below is the code snippet I used that found the duplicates: