String kernels can exploit biases in SMILES string format, skewing performance metrics

Ryan-Rhys / FlowMO

Library for training Gaussian Processes on Molecules

MIT License

36 stars 15 forks source link

String kernels can exploit biases in SMILES string format, skewing performance metrics #26

Open cyrusmaher opened 3 years ago

cyrusmaher commented 3 years ago

Just a heads up on this issue: https://github.com/deepchem/moleculenet/issues/15

I propose that string-based classifiers canonicalize smiles prior to processing to prevent confounded performance, CI, etc. estimates.

Ryan-Rhys commented 3 years ago

Thanks for raising this! I made a change to the photoswitch dataset to canonicalise all SMILES as a preprocessing step a couple of weeks ago, will make sure this is implemented for the other datasets!

cyrusmaher commented 3 years ago

It's possible canonicalization doesn't fully eliminate the bias (e.g. if one set calculates smiles at a different pH or is more likely to include salt forms). You can see that in the enrichment for "." and "+" characters between positive and negative examples in clintox.

Ryan-Rhys commented 3 years ago

Interesting! @henrymoss and I will keep track of this conversation you guys are having in DeepChem!

henrymoss commented 3 years ago

This is interesting indeed!

I wonder if this aligns with the observed lack of improvements we were getting when augmenting the data by adding extra (non-canonical) SMILES. Basically, we could only learn from training data in canonical form, as our test data was also canonical. Even increasing the data x5 (through augmentation), we couldn't improve performance.