devalab / DeepPocket

Ligand Binding Site detection using Deep Learning
MIT License
89 stars 26 forks source link

How to avoid data leakage? #11

Closed hi7049 closed 2 years ago

hi7049 commented 2 years ago

In the "Data sets and Preprocessing" section of your paper, you mention that " we removed all proteins from the training set that had either sequence identity greater than 50% or ligand similarity greater than 0.9 and sequence identity greater than 30%".

  1. How do you define sequence identity and ligand similarity?
  2. Could you provide the scripts to calculate sequence identity and ligand similarity?
  3. You mention twice sequence identities which are greater than 50% and 30%. Do you mean the protein sequence identity greater than 50% and ligand sequence identity greater than 30%?
RishalAggarwal commented 2 years ago

Hey, you can check out the code for creating sequence based splits here: https://github.com/gnina/scripts#generating-clustered-cross-validation-splits-of-data

for ligand similarity we use tanimoto, not sequence. we consider protein similarity of 30% when ligand tanimoto similarity is greater than 0.9