Closed octavian-ganea closed 3 years ago
Hi, @octavian-ganea.
Thank you for pointing this out. I believe you are right about my original implementation of this filename shuffling. I neglected to update this file after implementing the "per folder" option in another (currently private) repository. I have just pushed a change now making the "per folder" dataset filename split the default strategy in this repository.
If you have any other questions or concerns, please let me know and feel free to reopen this issue. Thanks again!
hi,
Thanks for putting this dataset out. Looking at https://github.com/amorehead/DIPS-Plus/blob/main/project/datasets/builder/partition_dataset_filenames.py#L68 it seems to me that the train/val/test splits are done randomly as opposed to a "per folder" option. For example, pairs 3a74.pdb1_0.dill and 3a74.pdb2_0.dill could end up in different splits, but I assume they are very similar (which, at a glance, seems to be the case in terms of the sequences of residue names, but not in terms of 3d coordinates).
Can this be an issue for ml pipelines?
thanks