ATOMScience-org / AMPL

The ATOM Modeling PipeLine (AMPL) is an open-source, modular, extensible software pipeline for building and sharing models to advance in silico drug discovery.
MIT License
131 stars 59 forks source link

ECFP4 fingerprints fail to distinguish some very different molecules #311

Open mcloughlin2 opened 1 week ago

mcloughlin2 commented 1 week ago

In MultitaskScaffoldSplitter, with certain datasets, you often see warning messages saying "two scaffolds match exactly?!?". This happens when the minimum Tanimoto distance between pairs of compounds from the two scaffolds is zero. One would think this means that the same compound somehow wound up in two different scaffold sets, but it doesn't.

The actual issue is that we generally compute Tanimoto distances with radius 2 (ECFP4) fingerprints, and two compounds with very different scaffolds can have the same ECFP4 fingerprint. Here are some examples of pairs of compounds that led to the above warning message:

SMILES 1: CC(=O)N(C)[C@@H](Cc1ccccc1)C(=O)N(C)[C@@H](Cc1ccccc1)C(N)=O
SMILES 2: CC(=O)N(C)[C@@H](Cc1ccccc1)C(=O)N(C)[C@@H](Cc1ccccc1)C(=O)N(C)[C@@H](Cc1ccccc1)C(=O)N(C)[C@@H](Cc1ccccc1)C(N)=O

which have structures: image image

SMILES 3: CC(CN1CCCCCC1)NC(=O)/C=N/O
SMILES 4: CC(CN1CCCCC1)NC(=O)/C=N/O

with structures: image image The one on the left has a 7- rather than a 6-membered ring, therefore a completely different scaffold. However, because an ECFP fingerprint simply represents a bag of chemical substructures found within a specified radius of each atom, these two molecules have exactly the same fingerprint at radius 2. You have to increase the radius to 3 to get different fingerprints for these two pairs of examples.

The solution for MultitaskScaffoldSplitter is simply to increase the radius used for computing the scaffold-scaffold distance matrices. We may want to do likewise in the other AMPL modules where fingerprints are commonly used for measuring and visualizing chemical diversity: chem_diversity, diversity_plots, compare_splits_plots and rdkit_easy. This is also something to think about when using ECFP features in AMPL models.

paulsonak commented 1 week ago

Perhaps we should change the default to radius 3 in all the places it's used?