Dataset in the Code - Githubissues

klee5264 commented 3 years ago

Hello there,

Thanks for sharing such a nice idea and the code. It is motivating!

Well, I am just beginning to reconstruct your code and have encountered an issue. Please correct me if I am wrong. According to README the file named 'chembl27_preprocessed_filtered_act_inact_comps_10.0_20.0_blast_comp_0.2.txt ' should be the training data set that you obtained through filtering ChEMBL v23 data(about 15M dataset), right?

So, I expected the number of data included in the file be 769,935, matching the one in the paper, but I found 2,292,989 target-ligand pairs in the file, which is nearly three times larger. Is it that you updated the file augmenting the data? or that I have to do some data processing in order to get 769,935 pairs? I am a little confused.

I'd appreciate if you could help me with this.

Thanks

tuncadogan commented 3 years ago

Hi,

Thank you for your interest in DEEPScreen. "chembl27_preprocessed_filtered_act_inact_comps_10.0_20.0_blast_comp_0.2.txt" is the updated version of the ChEMBL v23 dataset mentioned in our paper. This one is constructed using ChEMBL database version 27, whereas the old one with 769,935 data points was constructed using v23. That is why it has much more data points compared to the old one. You do not have to do filtering of any sort on this dataset at all. If you wish to train/test a model, please follow the instructions provided in the Readme file in the PyTorch branch of our repo (which is the new and the main/default branch).

klee5264 commented 3 years ago

Superb, thanks a lot for the positive and prompt answer!

cansyl / DEEPScreen

Dataset in the Code #9