BioinfoMachineLearning / DIPS-Plus

The Enhanced Database of Interacting Protein Structures for Interface Prediction
https://zenodo.org/record/5134732
GNU General Public License v3.0
48 stars 8 forks source link

regarding random split #2

Closed octavian-ganea closed 3 years ago

octavian-ganea commented 3 years ago

hi,

Thanks for putting this dataset out. Looking at https://github.com/amorehead/DIPS-Plus/blob/main/project/datasets/builder/partition_dataset_filenames.py#L68 it seems to me that the train/val/test splits are done randomly as opposed to a "per folder" option. For example, pairs 3a74.pdb1_0.dill and 3a74.pdb2_0.dill could end up in different splits, but I assume they are very similar (which, at a glance, seems to be the case in terms of the sequences of residue names, but not in terms of 3d coordinates).

Can this be an issue for ml pipelines?

thanks

amorehead commented 3 years ago

Hi, @octavian-ganea.

Thank you for pointing this out. I believe you are right about my original implementation of this filename shuffling. I neglected to update this file after implementing the "per folder" option in another (currently private) repository. I have just pushed a change now making the "per folder" dataset filename split the default strategy in this repository.

If you have any other questions or concerns, please let me know and feel free to reopen this issue. Thanks again!