deepchem / moleculenet

Moleculenet.ai Datasets And Splits
MIT License
88 stars 19 forks source link

Potential Processing Error in the Original QM8 Dataset on Some Tasks #41

Open rbharath opened 2 years ago

rbharath commented 2 years ago

There is a potential error in the QM8 dataset from the original MoleculeNet paper caused by duplicate columns (possibly due to a pandas data processing error).

https://github.com/deepchem/deepchem/issues/2747

We are still working to verify the error but in the meanwhile there is a fix PR under review that you can use:

https://github.com/deepchem/deepchem/pull/2756

Assuming the error is indeed present, the benchmarking numbers for QM8 may need to be rerun. The duplicated columns are for two very similar tasks though (the two tasks are to predict DFT results on the same molecule computed with the same functional but different basis sets) so I suspect that the qualitative changes will be relatively minimal (models have in effect been double predicting one DFT run instead of two slightly different DFT runs)