Fixing issues with two SMILES strings for nitro-containing compounds

davidlmobley commented 9 years ago

Due to some sort of earlier issue (possibly with Antechamber parsing of the files), the .mol2 files for two compounds (mobley_3802803 and mobley_9741965) had incorrect bond orders in nitro groups, which had resulted in generation of incorrect SMILES strings when FreeSolv was created. When these molecules were re-generated from SMILES, each molecule would end up with a net charge of -1 per nitro group (since they were treated as N([O-])[O-] rather than [N+](=O)[O-] ). The SMILES strings are now corrected, as well as the bond orders in the .mol2 files, though the .mol2 files retain the charges corresponding to the charges used in the hydration free energy calculations.

A more complete fix - which will also fix the topology files and calculated values - will come when we finish our calculations which re-generate everything from the primary data. These will likely be complete in the near future.

jchodera commented 9 years ago

Aren't the isomeric SMILES strings the 'canonical source data' for each compound? If so, why are we deriving SMILES strings from mol2 files?

What actually is the canonical source data for the compounds here?

davidlmobley commented 9 years ago

Currently, the isomeric SMILES are the canonical source data. On an ongoing basis, we're not deriving SMILES from .mol2 files. However, this was done at several times in the past. For example, some of the data comes from this type of scenario: Someone provides us with a compound name, and a .mol2 file which is supposed to contain that molecule. We generate isomeric SMILES from the compound name, and isomeric SMILES from the .mol2 file, and cross-check these to confirm that they match, then retain the isomeric SMILES.

Since (until very recently) papers did not typically contain SMILES strings from compounds, you can really think of every compound in the set as originating from some experimental data which listed a compound name or a 2D structure or both. So in some sense the original data at some point was a compound name. But these were not always machine-parseable, were sometimes ambiguous (and the intended compound had to be determined from context), and so on. Also, we've made no effort to retain original compound names over the years.

So, long story short - the isomeric SMILES are the data from which we ought to be able to re-generate the whole set now and going forward. But because the original construction of the set involved someone in the past manually pulling things out of papers, there can be errors in every piece of data we have - compound names, SMILES, .mol2 files, etc. That's precisely what makes this task hard. Generating the canonical source data in the first place required some manual aspects and hence is susceptible to error. The SMILES strings in my view will have fewer issues than any of the other data we have - but still can have issues.

MobleyLab / FreeSolv

Fixing issues with two SMILES strings for nitro-containing compounds #23