Potential duplicate molecules in FreeSolv Set

bannanc commented 7 years ago

While typing FreeSolv molecules with smirnoff99Frosst, I found 4 molecules that are potentially duplicated in the FreeSolv set. Below is the code snippet I used that found the duplicates:

import glob
from openforcefield.utils import read_molecules
from openeye import oechem

# untarred mol2files_sybyl.tar.gz
DBpath = "/FreeSolv/mol2files_sybyl/*.mol2"
for file in glob.glob(DBpath):
    mol = read_molecules(file, verbose = False)[0]
    f = file.split('/')[-1]
    c_mol = oechem.OEMol(mol)
    oechem.OEAddExplicitHydrogens(c_mol)
        smi = oechem.OECreateIsoSmiString(mol)
    f = file.split('/')[-1]
    if smi in isosmiles_to_mol:
        print("File:   %35s %35s" % (f, smi_to_file[smi]))
        print("Title:  %35s %35s" % (c_mol.GetTitle(), isosmiles_to_mol[smi].GetTitle()))
        print("SMILES: %35s %35s" % (smi, oechem.OECreateIsoSmiString(isosmiles_to_mol[smi])))
        print('\n')

    isosmiles_to_mol[smi] = c_mol
    smi_to_file[smi] = f

# OUTPUT: 

#File:                   mobley_4689084.mol2                  mobley_352111.mol2
#Title:               2-acetoxyethyl acetate              2-acetoxyethyl acetate
#SMILES:                    CC(=O)OCCOC(=O)C                    CC(=O)OCCOC(=O)C
#
#
#File:                   mobley_9897248.mol2                  mobley_819018.mol2
#Title:  (2Z)-3,7-dimethylocta-2,6-dien-1-ol (2E)-3,7-dimethylocta-2,6-dien-1-ol
#SMILES:                   CC(=CCCC(=CCO)C)C                   CC(=CCCC(=CCO)C)C
#
#
#File:                   mobley_9913368.mol2                 mobley_4465023.mol2
#Title:             (E)-1,2-dichloroethylene            (Z)-1,2-dichloroethylene
#SMILES:                           C(=CCl)Cl                           C(=CCl)Cl
#
#
#File:                   mobley_9979854.mol2                  mobley_628086.mol2
#Title:      (2R)-1,1,1-trifluoropropan-2-ol     (2S)-1,1,1-trifluoropropan-2-ol
#SMILES:                       CC(C(F)(F)F)O                       CC(C(F)(F)F)O

jchodera commented 7 years ago

Oh dear.

davidlmobley commented 7 years ago

I asked @bannanc to post this. I've previously filtered for duplicates, so I'll have to dig in and figure out what's different about my prior duplicate filtering versus her duplicate filtering. Though, er, it does look somewhat like it could be an issue of chirality for three of these since the compound names clearly indicate chirality not reflected in the SMILES Caitlin has. It's not clear WHY this would be, though, as she does appear to be using isomeric SMILES strings. But, well, (2R)-1,1,1-trifluoropropan-2-ol vs (2S)-1,1,1-trifluoropropan-2-ol...

jchodera commented 7 years ago

But 2-acetoxyethyl acetate vs 2-acetoxyethyl acetate?

Sounds like an erratum may be in our future.

bannanc commented 7 years ago

Aren't there ways to indicate the chirality in the SMILES? I thought isomeric SMILES were supposed to include chirality, is there a different OE function I should use to get the SMILES?

jchodera commented 7 years ago

I thought it was just OEMolToSmiles.

How do we prevent these issues from happening in the future? What can we do to refine the process to avoid these kinds of mistakes?

bannanc commented 7 years ago

Ok, if I use OEMolToSmiles the only duplicate I see is the 2-acetoxyethyl acetate

davidlmobley commented 7 years ago

@jchodera :

How do we prevent these issues from happening in the future? What can we do to refine the process to avoid these kinds of mistakes?

These are all results of the original "data archeology" process we're not able to curate. Specifically, for every single data point here, at some point in the past, some human (not necessarily us -- many of these come from Rizzo's compilations or even earlier compilations) took a structure in a table in a paper and did SOMETHING to it to get a name and a molecular structure which ended up in a mol2 file. Because this was a time consuming, human-intensive process, it was error prone (names and structures not matching, duplicate molecules, names and structures being consistent but not matching the intended molecule, etc.). I've been able to detect and remove many of the errors over the years through the various curation steps I did, but it seems like each new time I/we come up with a slightly different way of processing the whole thing we come up with one or two new issues. I'm quite confident that there is NO way of making sure the whole thing is perfect. (Even if you got a magical robot which could redo all of the experiments in an automated way, generate all of the experimental data from scratch, and re-create all of the structures/IUPAC names/SMILES all in one go, you'd STILL have the problem that some of the compound vendors will have sent you the wrong compounds, etc.)

One could in principle go back to the original literature and pull all of the data again and cross-check against what we have here, but that would be equally time-consuming, human-intensive, and error-prone, not to mention the fact that some of what is here actually represents CORRECTIONS to the literature (finding mistakes in literature tables, etc.).

The idea of an erratum reminds me of one mistake I made in the latest FreeSolv update paper. In the PREVIOUS paper, I had planned that I would not do erratums unless they would significantly affect our conclusions, so I indicated clearly that all further updates to the database would be made on the FreeSolv repo itself. I forgot to do that in this paper, so we may need to do an erratum that (a) adds any corrections resulting from this issue, and (b) makes clear that all further updates will be made on the GitHub repo rather than via erratum.

(Errata are a terrible place for corrections to databases since one potentially might need to make many such corrections, such as if new experimental values become available or existing ones are better curated.)

davidlmobley commented 7 years ago

@bannanc :

I'd always used code more like yours:

oechem.OECreateIsoSmiString(mol)

So I'm curious to understand the differnece between these.

bannanc commented 7 years ago

@davidlmobley That was the impression as well. I had an issue with smirky where I use a molecule's SMILES string as a dictionary key. When using OEMolToSmiles didn't work, it would regenerate a SMILES string for a molecule and wouldn't be able to find it in the dictionary, but if I used OECreateIsoSmiString it always creates the same SMILES string. However it looks like OECreateIsoSmiString doesn't include the characters to indicate chirality/isomers.

davidlmobley commented 7 years ago

However it looks like OECreateIsoSmiString doesn't include the characters to indicate chirality/isomers.

Hmm, that seems very odd, as I've used it for this many times in the past. I'm thinking there's something specific in how you're using it here (perhaps what processing you have or have not done on the molecule first) that is making it not provide this info. I'll have to dig in.

davidlmobley commented 7 years ago

OK, so to update on this:

It looks like OEMolToSmiles is what we want; I'm checking with Support on why OECreateIsoSmiString sometimes leads to different behavior (the docs leave it unclear and suggest BOTH for generating canonical isomeric SMILES)
There IS one duplicate pair here, mobley_4689084.mol2 vs mobley_352111.mol2
I think they sneaked in because at one point they had different names AND different SMILES (due to non-canonical SMILES) so my checks (checking the SMILES string for each molecule against the canonical SMILES for all other molecules) never caught the duplicate
The duplicate sneaked in because of an error by whoever was running SAMPL1 (OpenEye/Guthrie) -- they apparently included this in the prediction challenge (under a different name) even though it was already included in public databases (such as Rizzo's and mine) under a different name, and this mistake was never caught.

While having duplicates is bad, this is about as benign a duplicate as could possibly happen, in that the experimental value reported in both cases was identical, and the calculated values are within uncertainty of one another, so the overall effect is minor.

davidlmobley commented 7 years ago

For the record, this is info from James Haigh at OpenEye support:

It looks like we need to update the glossary part of the documentation to use OEMolToSmiles rather than OECreateIsoSmiString. OECreateIsoSmiString absolutely creates a canonical isomeric SMILES but only of the exact molecule that is present. OEMolToSmiles performs several perception calls on the molecule to ensure more consistency in the SMILES output.

Basically if you are reading molecules from different input sources they may be perceived in multiple different ways depending on the input file format or the method used to read. There are also multiple aromaticity models. OEMolToSmiles does perception to ensure consistency e.g. applying the OpenEye aromaticity model, perceiving stereochemistry etc.

A simple case is the Kekulé form of benzene. If I read that using OEParseSmiles and the generate a SMILES using OECreateIsoSmiString then I get C1=CC=CC=C1 out. But with either OEReadMolecule to read it, or OEMolToSmiles to generate a SMILES, I get c1ccccc1. See attached example.

Please let me know if you have any questions.

MobleyLab / FreeSolv

Potential duplicate molecules in FreeSolv Set #40