regarding random split - Githubissues

BioinfoMachineLearning / DIPS-Plus

The Enhanced Database of Interacting Protein Structures for Interface Prediction

GNU General Public License v3.0

48 stars 8 forks source link

hi,

Thanks for putting this dataset out. Looking at https://github.com/amorehead/DIPS-Plus/blob/main/project/datasets/builder/partition_dataset_filenames.py#L68 it seems to me that the train/val/test splits are done randomly as opposed to a "per folder" option. For example, pairs 3a74.pdb1_0.dill and 3a74.pdb2_0.dill could end up in different splits, but I assume they are very similar (which, at a glance, seems to be the case in terms of the sequences of residue names, but not in terms of 3d coordinates).

Can this be an issue for ml pipelines?

thanks

BioinfoMachineLearning / DIPS-Plus

regarding random split #2