Closed brycejoh16 closed 3 weeks ago
Hi @brycejoh16,
Thank you for the kind words!
The scores in this file were computed with the training script available in the ProteinNPT repo. To summarize:
use_validation_set
parameter is not used / set to False in our training script.nan
values may show up when computing spearman in the multi-property settings as you may not have all property values assayed for all mutantsUnfortunately, we do not currently have plans to release these embeddings given the size involved (we store full sequence embeddings / not mean-pooled as we leverage the token-level granularity in the axial attention of ProteinNPT).
Looking forward to seeing what you are building!
Kind regards, Pascal
Hi @brycejoh16 -- closing this issue as I believe it is fully addressed by the above, but feel free to re-open if needed. Best, Pascal
Hi ProteinGym team,
Thank you for providing both a supervised and an unsupervised benchmark to the community. This resource makes it 100x easier to benchmark and compare models. The community was in dire need of such a tool.
However, I have a few questions :
https://marks.hms.harvard.edu/proteingym/DMS_supervised_substitutions_scores.csv
Specifically the columns labeled: Spearman_fitness, Std_dev_Spearman_fitness, num_obs_Spearman_fitness, standardized_Spearman_fitness
My guess is that Spearman_fitness is the mean across the 5 (in two cases 4) splits for the test set only for a DMS_id which is not manipulated. Std_dev_Spearman_fitness is the standard deviation across those 5 splits. num_obs_Spearman_fitness is the number of observed datapoints for that spearman correlation. Which I assume always should be equal to the number of datapoints in the test fold, correct? I'm confused by this because it looks like you are using spearman in during training to check your validation set, so I just wanted to make sure it wasn't that value either. standardized_Spearman_fitness is when you are standardizing (values between 0-1) the training set for each fold before training which effects the end spearman correlation, correct? And the spearman's reported on proteingym.org and ProteinNPT paper are not the standardized values, instead they are the non-manipulated values.
Can you also please comment on if you are calculating spearman differently for ProteinNPT. I guess I'm unclear why their would ever be nan's in the experimental values?
Take care, Bryce