OATML-Markslab / ProteinGym

Official repository for the ProteinGym benchmarks
MIT License
218 stars 20 forks source link

Clarification on Scoring + MSA Transformer Request #33

Closed brycejoh16 closed 3 weeks ago

brycejoh16 commented 4 months ago

Hi ProteinGym team,

Thank you for providing both a supervised and an unsupervised benchmark to the community. This resource makes it 100x easier to benchmark and compare models. The community was in dire need of such a tool.

However, I have a few questions :

  1. Can you please describe how these scores were calculated in this scoring file for single mutant supervised splits:
    https://marks.hms.harvard.edu/proteingym/DMS_supervised_substitutions_scores.csv

Specifically the columns labeled: Spearman_fitness, Std_dev_Spearman_fitness, num_obs_Spearman_fitness, standardized_Spearman_fitness

My guess is that Spearman_fitness is the mean across the 5 (in two cases 4) splits for the test set only for a DMS_id which is not manipulated. Std_dev_Spearman_fitness is the standard deviation across those 5 splits. num_obs_Spearman_fitness is the number of observed datapoints for that spearman correlation. Which I assume always should be equal to the number of datapoints in the test fold, correct? I'm confused by this because it looks like you are using spearman in during training to check your validation set, so I just wanted to make sure it wasn't that value either. standardized_Spearman_fitness is when you are standardizing (values between 0-1) the training set for each fold before training which effects the end spearman correlation, correct? And the spearman's reported on proteingym.org and ProteinNPT paper are not the standardized values, instead they are the non-manipulated values.

Can you also please comment on if you are calculating spearman differently for ProteinNPT. I guess I'm unclear why their would ever be nan's in the experimental values?

  1. Would it be possible to download the 1TB of MSA Transformer embeddings at all for the singles only? Or perhaps you saved the mean pooled embeddings since they would be a lot smaller in dimension and probably only take up a few GB's? I know that may be a big ask, but I have the space to download them and it would save a lot of time and effort :)

Take care, Bryce

pascalnotin commented 3 months ago

Hi @brycejoh16,

Thank you for the kind words!

  1. The scores in this file were computed with the training script available in the ProteinNPT repo. To summarize:

    • Spearman_fitness is the spearman computed across all test folds combined (although conclusion should be similar if we were instead looking at average spearman across the 5 folds)
    • you are correct about Std_dev_Spearman_fitness (std deviation across the 5 splits) and num_obs_Spearman_fitness
    • although we considered it in early stages of development, we are not using the validation set to inform when to stop training, but rather always train for a fixed number of steps (10k steps) --> the use_validation_set parameter is not used / set to False in our training script.
    • standardized_Spearman_fitness is computed by first standard normalizing test scores from each fold separately, then combining them together and computing the spearman on the combined set (this is legacy code from early stages of development / something that can be ignored)
    • nan values may show up when computing spearman in the multi-property settings as you may not have all property values assayed for all mutants
  2. Unfortunately, we do not currently have plans to release these embeddings given the size involved (we store full sequence embeddings / not mean-pooled as we leverage the token-level granularity in the axial attention of ProteinNPT).

Looking forward to seeing what you are building!

Kind regards, Pascal

pascalnotin commented 3 weeks ago

Hi @brycejoh16 -- closing this issue as I believe it is fully addressed by the above, but feel free to re-open if needed. Best, Pascal