MobleyLab / FreeSolv

Experimental and calculated small molecule hydration free energies
http://www.escholarship.org/uc/item/6sd403pz
95 stars 53 forks source link

mobley_3323117 (sulfolane) has non-standard SMILES #51

Open jchodera opened 1 year ago

jchodera commented 1 year ago

Molecule mobley_3323117 (sulfolane) is written with the non-standard SMILES C1CC[S+2](C1)([O-])[O-], rather than the more standard C1CCS(=O)(=O)C1.

Despite being equivalent in total charge, these forms are inequivalent due to the provided formal charges (+2 for S, -1 for O) vs the standard SMILES (all atoms have 0 formal charge), which are rendered inequivalent in molecular representations in the OpenFF toolkit (with the OpenEye backend):

>>> from openff.toolkit.topology import Molecule
>>> freesolv_molecule = Molecule.from_smiles('C1CC[S+2](C1)([O-])[O-]')
>>> standard_molecule = Molecule.from_smiles('C1CCS(=O)(=O)C1')
>>> freesolv_molecule.generate_unique_atom_names()
>>> standard_molecule.generate_unique_atom_names()
>>> [(atom.name, atom.formal_charge.m) for atom in freesolv_molecule.atoms]
[('C1x', 0), ('C2x', 0), ('C3x', 0), ('S1x', 2), ('C4x', 0), ('O1x', -1), ('O2x', -1), ('H1x', 0), ('H2x', 0), ('H3x', 0), ('H4x', 0), ('H5x', 0), ('H6x', 0), ('H7x', 0), ('H8x', 0)]
>>> [(atom.name, atom.formal_charge.m) for atom in standard_molecule.atoms]
[('C1x', 0), ('C2x', 0), ('C3x', 0), ('S1x', 0), ('O1x', 0), ('O2x', 0), ('C4x', 0), ('H1x', 0), ('H2x', 0), ('H3x', 0), ('H4x', 0), ('H5x', 0), ('H6x', 0), ('H7x', 0), ('H8x', 0)]

Would it be reasonable to correct the non-standard SMILES string and re-generate the database? Or are there ways to automatically standardize the formal charges?

davidlmobley commented 1 year ago

This (or all SMILES) could be canonicalized. Probably the best option, I think, is to fix only THIS SMILES, however. Otherwise we run into the problem of "what should be considered the authoritative identifier for a molecule, from which everything else can be generated?" We've moved to treating SMILES as the source data and authoritative, meaning that re-generating all SMILES from the SMILES is probably unwise, since then we're overwriting our authoritative source data.

So, perhaps we should correct only this one?