deepchem / moleculenet

Moleculenet.ai Datasets And Splits
MIT License
88 stars 19 forks source link

Filtering False Positives from MoleculeNet datasets #13

Open rbharath opened 4 years ago

rbharath commented 4 years ago

A number of MoleculeNet datasets have obvious false positives attached. For example, the HIV dataset has a number of cytotoxic compounds attached as hits. There are likely a number of PAINs compounds scattered throughout as well. These should be filtered for the next version of MoleculeNet. (H/T @PatWalters for bringing this up).

Let's use this issue to coordinate cleanup efforts and discussion. As a first point, it would probably be useful to keep the v1 MoleculeNet datasets around as records. We could then add a version flag to the dc.molnet.load_X functions that returns either the v2 dataset or the v1 dataset.

rbharath commented 4 years ago

As a useful link, here's @PatWalters blog post about filtering libraries http://practicalcheminformatics.blogspot.com/2018/08/filtering-chemical-libraries.html