devalab / DeepPocket

Ligand Binding Site detection using Deep Learning
MIT License
89 stars 26 forks source link

Training Classifier Dataset #27

Closed drorhunvural closed 1 year ago

drorhunvural commented 1 year ago

Hi @RishalAggarwal,

Firstly, thank you for this repo.

python train.py -m model.py --train_types scPDB_train0.types --test_types scPDB_test0.types -i 200000 --train_recmolcache scPDB_new.molcache2 --test_recmolcache scPDB_new.molcache2 -r val0 -o /model_saves/val9 --base_lr 0.001 --solver Adam

As seen above, file _scPDB_train0_ is required for training classifier.

The sample content of the _scPDBtrain0 file is as follows;

0 -6.417309121621622 37.99337461018711 86.51209004677753 10mh_1/protein_0.gninatypes
0 -48.73792600326857 40.15845814418013 90.75518894134738 10mh_1/protein_0.gninatypes
0 -22.384561944279785 38.16762551867219 62.667952578541794 10mh_1/protein_0.gninatypes
0 4.418982018111255 43.43278783958602 81.18465174644241 10mh_1/protein_0.gninatypes
...

My first question is how did you do the labeling (0 or 1) of whether the proteins are pockets according to their coordinates. Is this dataset a public dataset? You didn't mention it in the paper too. How did you create this train file?

My second question is that if you did labeling this dataset by yourself how can I do this pocket / non-pocket (0 or 1) labeling according to the coordinates for my protein files.

Note: Neither COACH420 nor HOLO4k nor scPDB datasets contain coordinates for non-druggable regions. How did you labeled your _scPDBtrain0 file as a 0 (non-druggable) or 1 (druggable).

drdeeplearning commented 1 year ago

+1

I have the same question. This issue needs to be clarified before I cite.

RishalAggarwal commented 1 year ago

The coordinates in these files are pocket centers found using fpocket. If the pocket center is within 4 angstrom of any ligand atom it is labelled as 1, else 0.

drorhunvural commented 1 year ago

Sorry but I couldn't get the your sentence "If the pocket center is within 4 angstrom of any ligand atom it is labelled as 1, else 0."

Are you using extra ligand files (sdf) to label the pocket coordinates as 1 or 0 after running fpocket? So how do you decide which coordinates you get after running Fpocket are 0? Are you using another tool for this?

RishalAggarwal commented 1 year ago

Yes the ligands are present in separate sdf files in these datasets (for coach420 and holo4k i have provided extracted ligand files). To detect if the coordinate is within 4 angstrom, you can write a simple python script.

RishalAggarwal commented 1 year ago

see "make_types.py" for example python script on how to do this

drorhunvural commented 1 year ago

Thank you for the reply.

95% of the dataset is non-pocket (0), 5% of the dataset is pocket (1). Do you do anything to prevent overfitting or do you use the dataset in this way for your training? I couldn't find the place where this part is mentioned in the paper.

RishalAggarwal commented 1 year ago

positive samples are oversampled during training

RishalAggarwal commented 1 year ago

closing this, feel free to reopen with further issues