J-SNACKKB / FLIP

A collection of tasks to probe the effectiveness of protein sequence representations in modeling aspects of protein design
Academic Free License v3.0
94 stars 14 forks source link

Adding Binding residues (Bind) splits #13

Closed joaquimgomez closed 2 years ago

joaquimgomez commented 2 years ago

Once applied, this PR will add Binding residues (Bind) splits.

Proposed splits:

The splits one_vs_many, two_vs_many and from_publication aim to analyze the impact of training/testing taking into account that proteins have more than one type of ligands, i.e., we can have proteins with only residues with metal ligands and/or proteins with residues with metal ligand and other residues with small ligand.

On the other hand, the splits one_vs_sm, one_vs_mn and one_vs_sn aim to analyze the impact of training in proteins with residues having only one type of residue and testing on proteins that hve multi-ligand residues.

To do:

Open issues:

sacdallago commented 2 years ago

three_vs_many: train on proteins with 1 or 2 types of ligand from original training set, test on original test sets TestSet300 and TestSetNew46 mixed.

Why is it called three_vs_many?

joaquimgomez commented 2 years ago

three_vs_many: train on proteins with 1 or 2 types of ligand from original training set, test on original test sets TestSet300 and TestSetNew46 mixed.

Why is it called three_vs_many?

There was a typo. "with 1 or 2 types of ligand" ==> "with 1, 2 or 3 types of ligand". If you prefer another name, I am open to suggestions. Maybe all or many_vs_many?

sacdallago commented 2 years ago

Some observations:

  1. It would be beneficial to add a separate section with a legend in the README for the types (instead of just having it in the target).
  2. three_vs_many and one_vs_many encode the target differently (a stringified array vs. a string).
  3. If I understand correctly, the three_vs_many split is the same as in the publication? Then just call it "from_publication" or something :) makes it easier to know: OK, this is the one that I can compare to whatever table is in the manuscript.
  4. This is a residue to class problem, as such, CSV is probably not the best way to encode the data. You can use the standard we developed in bio-trainer: https://github.com/sacdallago/biotrainer/blob/main/data_standardization.md#residue---class . This translates to: you can have a single "master" sequence file (simply call it sequences.fasta), and then you can have as many label files as you have splits
joaquimgomez commented 2 years ago

Some observations:

  1. It would be beneficial to add a separate section with a legend in the README for the types (instead of just having it in the target).

-> Legend section added to the README.

  1. three_vs_many and one_vs_many encode the target differently (a stringified array vs. a string).

-> Solved.

  1. If I understand correctly, the three_vs_many split is the same as in the publication? Then just call it "from_publication" or something :) makes it easier to know: OK, this is the one that I can compare to whatever table is in the manuscript.

-> Name changed to "from_publication" everywhere.

  1. This is a residue to class problem, as such, CSV is probably not the best way to encode the data. You can use the standard we developed in bio-trainer: https://github.com/sacdallago/biotrainer/blob/main/data_standardization.md#residue---class . This translates to: you can have a single "master" sequence file (simply call it sequences.fasta), and then you can have as many label files as you have splits

-> Splits files changed accordingly with the standardization in bio-trainer.

joaquimgomez commented 2 years ago

I double-checked all the files concerning the PR. I think they are ready for a merge if there are no more comments.

sacdallago commented 2 years ago

Great!