Open cyrusmaher opened 3 years ago
Thanks for raising this! I made a change to the photoswitch dataset to canonicalise all SMILES as a preprocessing step a couple of weeks ago, will make sure this is implemented for the other datasets!
It's possible canonicalization doesn't fully eliminate the bias (e.g. if one set calculates smiles at a different pH or is more likely to include salt forms). You can see that in the enrichment for "." and "+" characters between positive and negative examples in clintox.
Interesting! @henrymoss and I will keep track of this conversation you guys are having in DeepChem!
This is interesting indeed!
I wonder if this aligns with the observed lack of improvements we were getting when augmenting the data by adding extra (non-canonical) SMILES. Basically, we could only learn from training data in canonical form, as our test data was also canonical. Even increasing the data x5 (through augmentation), we couldn't improve performance.
Just a heads up on this issue: https://github.com/deepchem/moleculenet/issues/15
I propose that string-based classifiers canonicalize smiles prior to processing to prevent confounded performance, CI, etc. estimates.