Closed joaquimgomez closed 2 years ago
three_vs_many: train on proteins with 1 or 2 types of ligand from original training set, test on original test sets TestSet300 and TestSetNew46 mixed.
Why is it called three_vs_many?
three_vs_many: train on proteins with 1 or 2 types of ligand from original training set, test on original test sets TestSet300 and TestSetNew46 mixed.
Why is it called three_vs_many?
There was a typo. "with 1 or 2 types of ligand" ==> "with 1, 2 or 3 types of ligand". If you prefer another name, I am open to suggestions. Maybe all
or many_vs_many
?
Some observations:
three_vs_many
and one_vs_many
encode the target differently (a stringified array vs. a string).three_vs_many
split is the same as in the publication? Then just call it "from_publication" or something :) makes it easier to know: OK, this is the one that I can compare to whatever table is in the manuscript.Some observations:
- It would be beneficial to add a separate section with a legend in the README for the types (instead of just having it in the target).
-> Legend section added to the README.
three_vs_many
andone_vs_many
encode the target differently (a stringified array vs. a string).
-> Solved.
- If I understand correctly, the
three_vs_many
split is the same as in the publication? Then just call it "from_publication" or something :) makes it easier to know: OK, this is the one that I can compare to whatever table is in the manuscript.
-> Name changed to "from_publication" everywhere.
- This is a residue to class problem, as such, CSV is probably not the best way to encode the data. You can use the standard we developed in bio-trainer: https://github.com/sacdallago/biotrainer/blob/main/data_standardization.md#residue---class . This translates to: you can have a single "master" sequence file (simply call it sequences.fasta), and then you can have as many label files as you have splits
-> Splits files changed accordingly with the standardization in bio-trainer.
I double-checked all the files concerning the PR. I think they are ready for a merge if there are no more comments.
Great!
Once applied, this PR will add Binding residues (Bind) splits.
Proposed splits:
Working at protein-level (train on sequences with X ligan type(s), test on else (~18% of the proteins have >1 type of ligand)):
one_vs_many
: train on proteins with only 1 type of ligand, test on proteins with 2 and 3 types of ligandstwo_vs_many
: train on proteins with 1 or 2 types of ligand, test on proteins with 3 types of ligandfrom_publication
: train on proteins with 1, 2 or 3 types of ligand from original training set, test on original test sets TestSet300 and TestSetNew46 (mixed).Working at residue-level (train on sequences with residues assigned to only 1 type of ligand, test on sequences with residues assigned to multiple classes (~4% of the residues have more than one type of ligand)):
one_vs_sm
: train on proteins with residues having only one type of ligand, test on proteins with residues having Small+Metal ligandsone_vs_mn
: train asone_vs_sm
but with balances classes, test on proteins with residues having Metal+Nuclearone_vs_sn
: train asone_vs_sm
but with balances classes, test on proteins with residues having Small+NuclearThe splits
one_vs_many
,two_vs_many
andfrom_publication
aim to analyze the impact of training/testing taking into account that proteins have more than one type of ligands, i.e., we can have proteins with only residues with metal ligands and/or proteins with residues with metal ligand and other residues with small ligand.On the other hand, the splits
one_vs_sm
,one_vs_mn
andone_vs_sn
aim to analyze the impact of training in proteins with residues having only one type of residue and testing on proteins that hve multi-ligand residues.To do:
Open issues: