Adding Binding residues (Bind) splits

joaquimgomez commented 2 years ago

Once applied, this PR will add Binding residues (Bind) splits.

Proposed splits:

Working at protein-level (train on sequences with X ligan type(s), test on else (~18% of the proteins have >1 type of ligand)):
- one_vs_many: train on proteins with only 1 type of ligand, test on proteins with 2 and 3 types of ligands
- two_vs_many: train on proteins with 1 or 2 types of ligand, test on proteins with 3 types of ligand
- from_publication: train on proteins with 1, 2 or 3 types of ligand from original training set, test on original test sets TestSet300 and TestSetNew46 (mixed).
Working at residue-level (train on sequences with residues assigned to only 1 type of ligand, test on sequences with residues assigned to multiple classes (~4% of the residues have more than one type of ligand)):
- one_vs_sm: train on proteins with residues having only one type of ligand, test on proteins with residues having Small+Metal ligands
- one_vs_mn: train as one_vs_sm but with balances classes, test on proteins with residues having Metal+Nuclear
- one_vs_sn: train as one_vs_sm but with balances classes, test on proteins with residues having Small+Nuclear

The splits one_vs_many, two_vs_many and from_publication aim to analyze the impact of training/testing taking into account that proteins have more than one type of ligands, i.e., we can have proteins with only residues with metal ligands and/or proteins with residues with metal ligand and other residues with small ligand.

On the other hand, the splits one_vs_sm, one_vs_mn and one_vs_sn aim to analyze the impact of training in proteins with residues having only one type of residue and testing on proteins that hve multi-ligand residues.

To do:

[x] Upload to GitHub final splits files
[x] Add README
[x] Add splits to FLIP/splits/README.md
[ ] Add original data to the server (?)

Open issues:

[ ] Impossible to create statistics for the splits because the target is not a single value.
[ ] Residues indexes start with 0? Implemented considering this.

sacdallago commented 2 years ago

three_vs_many: train on proteins with 1 or 2 types of ligand from original training set, test on original test sets TestSet300 and TestSetNew46 mixed.

Why is it called three_vs_many?

joaquimgomez commented 2 years ago

three_vs_many: train on proteins with 1 or 2 types of ligand from original training set, test on original test sets TestSet300 and TestSetNew46 mixed.

Why is it called three_vs_many?

There was a typo. "with 1 or 2 types of ligand" ==> "with 1, 2 or 3 types of ligand". If you prefer another name, I am open to suggestions. Maybe all or many_vs_many?

sacdallago commented 2 years ago

Some observations:

It would be beneficial to add a separate section with a legend in the README for the types (instead of just having it in the target).
three_vs_many and one_vs_many encode the target differently (a stringified array vs. a string).
If I understand correctly, the three_vs_many split is the same as in the publication? Then just call it "from_publication" or something :) makes it easier to know: OK, this is the one that I can compare to whatever table is in the manuscript.
This is a residue to class problem, as such, CSV is probably not the best way to encode the data. You can use the standard we developed in bio-trainer: https://github.com/sacdallago/biotrainer/blob/main/data_standardization.md#residue---class . This translates to: you can have a single "master" sequence file (simply call it sequences.fasta), and then you can have as many label files as you have splits

joaquimgomez commented 2 years ago

Some observations:

It would be beneficial to add a separate section with a legend in the README for the types (instead of just having it in the target).

-> Legend section added to the README.

three_vs_many and one_vs_many encode the target differently (a stringified array vs. a string).

-> Solved.

If I understand correctly, the three_vs_many split is the same as in the publication? Then just call it "from_publication" or something :) makes it easier to know: OK, this is the one that I can compare to whatever table is in the manuscript.

-> Name changed to "from_publication" everywhere.

This is a residue to class problem, as such, CSV is probably not the best way to encode the data. You can use the standard we developed in bio-trainer: https://github.com/sacdallago/biotrainer/blob/main/data_standardization.md#residue---class . This translates to: you can have a single "master" sequence file (simply call it sequences.fasta), and then you can have as many label files as you have splits

-> Splits files changed accordingly with the standardization in bio-trainer.

joaquimgomez commented 2 years ago

I double-checked all the files concerning the PR. I think they are ready for a merge if there are no more comments.

sacdallago commented 2 years ago

Great!

J-SNACKKB / FLIP

Adding Binding residues (Bind) splits #13