deepchem / deepchem

Democratizing Deep-Learning for Drug Discovery, Quantum Chemistry, Materials Science and Biology
https://deepchem.io/
MIT License
5.54k stars 1.69k forks source link

Duplicate/Incorrect Molecules in BBBP Dataset #3122

Open HenryJia opened 2 years ago

HenryJia commented 2 years ago

🐛 Bug

In addition to https://github.com/deepchem/deepchem/issues/2336 There appears to be 2 Bacitracin in the BBBP dataset, with different SMILES. One of which is BBBP+, one of which is BBBP-

55,Bacitracin,1,c1(c(cc(NC(CCC)=O)cc1)C(C)=O)OCC(CNC(C)C)O and 202,Bacitracin,0,CCC(C)C(N)C1=NC(CS1)C(=O)NC@@HC(=O)NC@HC(=O)NC@@HC(=O)NCCCC[C@@H]2NC(=O)C@HNC(=O)C@@HNC(=O)C@HNC(=O)C@@HNC(=O)C@@HC@@HCC

Clearly, only one of those can be Bacitracin as they are very different

HenryJia commented 1 year ago

Did a bit more digging: It seems that the first molecule is actually acebutolol according to Chemspider http://www.chemspider.com/Chemical-Structure.1901.html?rid=c3544756-c17f-41b3-b7a6-9b1ec7325512

HenryJia commented 1 year ago

Something else to add to this, it seems like there are 2 instances of Tiotidine

62,22767,1,c1(nc(NC(N)=[NH2])sc1)CSCCNC(=[NH]C#N)NC

and

384,Tiotidine,1,CN=C(NCCSCc1csc(N=C(N)N)n1)NC#N

Are the same according to ChemSpider http://www.chemspider.com/Chemical-Structure.45601.html?rid=4afdeaf4-b290-4375-b966-11b94813582e

HenryJia commented 1 year ago

There are also 2 instances of Homophenazine/Homofenazine. These molecules are indeed identical when viewed using RDKit and VMD

1864,homofenazine(homophenazine),1,C1=C(C(F)(F)F)C=CC3=C1N(C2=C(C=CC=C2)S3)CCCN4CCN(CCO)CCC4

660,homophenazine,1,[H+].[H+].[Cl-].[Cl-].OCCN1CCCN(CCCN2c3ccccc3Sc4cc(ccc24)C(F)(F)F)CC1