How to avoid data leakage?

devalab / DeepPocket

Ligand Binding Site detection using Deep Learning

MIT License

89 stars 26 forks source link

In the "Data sets and Preprocessing" section of your paper, you mention that " we removed all proteins from the training set that had either sequence identity greater than 50% or ligand similarity greater than 0.9 and sequence identity greater than 30%".

How do you define sequence identity and ligand similarity?
Could you provide the scripts to calculate sequence identity and ligand similarity?
You mention twice sequence identities which are greater than 50% and 30%. Do you mean the protein sequence identity greater than 50% and ligand sequence identity greater than 30%?

devalab / DeepPocket

How to avoid data leakage? #11