dengjianyuan / Respite_MPP

21 stars 4 forks source link

Issue with ESOL dataset #2

Open csnbritt opened 1 year ago

csnbritt commented 1 year ago

Hi. I enjoyed this paper, but I have some concerns about the source data. It appears for the ESOL data that you used the same data as is in the Grover github repo. If you compare it to the original Delaney paper on aqueous solubility, you can see that the authors of Grover messed up and used the values that Delaney predicted from his own QSAR model as the labels, rather than experimental data (https://pubs.acs.org/doi/10.1021/ci034243x - see supporting information). Delaney's predictions are not the intended target for this task, it should be the measured values instead

Edit: This probably also explains why grover + rdkit descriptors does so well for this task in particular - some of the rdkit descriptors are the same as those used to make the predictions in the original paper.

dengjianyuan commented 1 year ago

Hey Carson, thanks a lot for pointing this issue out. I will go check on the datasets asap and get you updated soon.