deepchem / moleculenet

Moleculenet.ai Datasets And Splits
MIT License
88 stars 19 forks source link

Processing of QM7/QM9 Targets #42

Open FelixKatz77 opened 2 years ago

FelixKatz77 commented 2 years ago

Hi, I wanted to use the splits of the QM7 and QM8 datasets for benchmarking when I noticed a discrepancy between the targets accessible via the load_qm7()/load_qm8() functions and the original targets of these datasets (http://quantum-machine.org/datasets/). I could not find any information on any processing of the targets. Could you clarify if any normalisation or rescaling was done?

I was also wondering how the benchmark performance was determined in the case of multitask datasets. In these cases, was a single task taken into account or the performance on all tasks? Thanks!

rbharath commented 2 years ago

I believe that outputs are normalized (see https://deepchem.readthedocs.io/en/latest/api_reference/moleculenet.html#qm7-datasets, and linked source). The discrepancy between the load functions and the original datasets is a little disconcerting and something we should investigate

For benchmark performance, I believe it is mean performance across all tasks but I'm going from memory and may be wrong

FelixKatz77 commented 2 years ago

I think the target processing is relevant to all the regression tasks. I tried to figure out the mapping between the targets in the datasets downloaded from https://moleculenet.org/datasets-1 and the targets you can access via the 'y' label after loading the datasets via dc.molnet.load_dataset() but could not figure it out. Would be great if you could comment on this.

FelixKatz77 commented 2 years ago

I figured out the normalization using the 'transformers' argument in dc.molnet.load_dataset().

FelixKatz77 commented 2 years ago

If get any more insights on the benchmarking for multitask datasets I would still be happy to learn about this. Thanks!