Open jchodera opened 1 year ago
This (or all SMILES) could be canonicalized. Probably the best option, I think, is to fix only THIS SMILES, however. Otherwise we run into the problem of "what should be considered the authoritative identifier for a molecule, from which everything else can be generated?" We've moved to treating SMILES as the source data and authoritative, meaning that re-generating all SMILES from the SMILES is probably unwise, since then we're overwriting our authoritative source data.
So, perhaps we should correct only this one?
Molecule mobley_3323117 (sulfolane) is written with the non-standard SMILES
C1CC[S+2](C1)([O-])[O-]
, rather than the more standardC1CCS(=O)(=O)C1
.Despite being equivalent in total charge, these forms are inequivalent due to the provided formal charges (+2 for S, -1 for O) vs the standard SMILES (all atoms have 0 formal charge), which are rendered inequivalent in molecular representations in the OpenFF toolkit (with the OpenEye backend):
Would it be reasonable to correct the non-standard SMILES string and re-generate the database? Or are there ways to automatically standardize the formal charges?