Non-canonical smiles confound string-based classifiers

cyrusmaher commented 3 years ago

Running a string kernel classifier on the clintox dataset, I can obtain an AUROC of 0.96. When I canonicalize the smiles, my AUROC drops to 0.69. This implies that there is a bias in the smiles format between positive and negative examples that string-based classifiers can exploit to obtain unrealistically high performance, thereby tainting downstream benchmarks.

A solution to this would be to update the dataset to include only canonicalized smiles.

rbharath commented 3 years ago

Oh wow, that's quite the find! Yes, this would definitely need to be fixed as we're overhauling MoleculeNet for the next v2 release. I'll mark this as a bug

gabegrand commented 3 years ago

@cyrusmaher do you have any insight as to what the bias in the SMILES is?

rbharath commented 3 years ago

Starting to think about this anymore, I think we canonicalize smiles before computing descriptors in DeepChem which should handle this (but I'm not sure).

@cyrusmaher Would it be possible to provide a brief reproducing code snippet? That would help us figure out what's happening :)

cjmielke commented 3 years ago

One consideration : maybe a character frequency count between the raw and canonicalized forms? Maybe there's extra parentheses or aromatic operators added? (:)

cyrusmaher commented 3 years ago

@rbharath The easiest way to reproduce this will be to run a string model on ClinTox with and without smiles canonicalization. Here is an example for canonicalization:

from rdkit import Chem
canonical_smi = Chem.MolToSmiles(Chem.MolFromSmiles(smi), canonical=True)

Without bloating this with the helper code for the string kernel, etc., here's what I ran:

@gabegrand I'm not sure precisely, but delocalization, tautomers, salts, etc can all be handled differently in systematic ways

cyrusmaher commented 3 years ago

@rbharath It's worth considering that canonicalization would not entirely eliminate this bias, e.g. if one database is more likely to include charged species (assuming a different pH or preparation). You can see evidence for this in the different levels of "+" and "." characters between positive and negative examples in clintox:

Edit: it appears that much of this significance is driven by smiles that turn out to be duplicated once they're canonicalized.

TWRogers commented 3 years ago

Firstly, thanks to the DeepChem & MoleculeNet contributers, it is a great library and a great benchmark!

However, I think this issue really needs to be fixed, before people publish papers (if they haven't already). And potentially the clintox dataset should be dropped altogether.

I was reproducing the textcnn result on the clintox dataset. I was very pleased to reproduce the benchmark AUC of ~0.995!

However, when I examined the underlying dataset I found that there were severe biases in the dataset, which should be fixed in the overall benchmark for clintox. The benchmark shows textcnn wining by around 11%, which is very unlikely to be true.

I observed the following in my experiments.

Training textcnn on smiles as given by deepchem dataloader: Train: 0.991, Val: 0.995, Test: 0.994
Train on only the last and first 2 characters of the smiles: Train: 0.850, Val: 0.700, Test: 0.956
Training on on RDKit canonical smiles: Train: 0.916, Val: 0.837, Test: 0.905

I think the most surprising thing was only using the first and last 2 characters of the smiles. You can still achieve a very high (apparently +7% on SOTA) test AUC of 0.956, however would you really trust such a classifier to detect toxic molecules?!

The third experiment about canon smiles agrees with the findings of @cyrusmaher

The model dc.models.TextCNNModel seems to use the smiles in dataset.ids which are not canonical. My suggestion would be to canonicalise them in the dataset loader by default for all datasets.

rbharath commented 3 years ago

Thanks for the detailed analysis! We're working towards a MoleculeNet 2.0 paper. We will update recommendations and benchmark analysis for Clintox as part of this release

TWRogers commented 3 years ago

Thanks for your reply, that's good to know 😄

cmahervir commented 1 year ago

Any updates on this? Thanks!

deepchem / moleculenet

Non-canonical smiles confound string-based classifiers #15