OpenBioML / protein-lm-scaling

Other
54 stars 15 forks source link

Create evaluation dataset for the contact prediction task #43

Open pascalnotin opened 9 months ago

pascalnotin commented 9 months ago

This issue is about creating the data that will support the contact prediction task evaluation (see issue#19).

We need to select ~100 protein sequences that cover a balanced spectrum of:

  1. Depth of alignment / number of homologs (covering a wide range of depths like Fig.1B of the ESM2 paper)
  2. Taxa (prokaryotes, humans, other eukaryotes, viruses)

We then need to extract the contact maps for each protein, which we will use as ground truth in the evaluation.