OATML-Markslab / ProteinGym

Official repository for the ProteinGym benchmarks
MIT License
219 stars 20 forks source link

Predictions for Each Mutant in the Random Cross-Validation Scheme #17

Closed benjaminalbert closed 6 months ago

benjaminalbert commented 9 months ago

Thank you very much for providing this amazing resource!

I would appreciate your help:

  1. Would you please direct me to the ProteinNPT predictions for each mutant in the random CV scheme?
  2. Were uncertainty quantifications calculated across the CV schemes? If so, would you please also provide those predictions per mutant?

By the way, the link provided in the README for downloading all the baseline scores on the DMS substitutions is dead, though I'm not sure if this zip would contain the data I'm looking for.

Thank you in advance, Benji

pascalnotin commented 9 months ago

Hi Benji,

Thank you! To answer your questions:

  1. The supervised scoring files at the mutant level are not yet available for download (only zero-shot predictions are available at the mutant level for now), but we will package them up and release soon. In the meantime, you may refer to the detailed performance files for supervised baselines here and to the updated ProteinNPT codebase with instructions to reproduce all results here.
  2. Are you asking for the uncertainty quantification estimates used to create Fig.10 of the ProteinNPT paper?
  3. Thank you for flagging the error with the download link. This was an outdated name for one of the the scoring files which we have just updated on the README. All downloads for files listed in the table should properly work, but please let us know if you run into any issues!

Best, Pascal

benjaminalbert commented 9 months ago

Hi Pascal, thanks for the response. Regarding question 2, I was wondering whether UQ estimates were available at the mutant level along with the predicted scores at the mutant level, not for the protein redesign experiments, but for the 5-fold random CV scheme. For example, when you release mutant-level scores for question 1, will you also provide mutant-level predicted standard deviations from the hybrid UQ approach?

pascalnotin commented 9 months ago

Hi Benji,

We will not be releasing UQ estimates within the detailed scoring files for question 1 as this is not something we have computed for all baselines & DMS assays in ProteinGym. But we should still have the data we used to create Figure 10 from the ProteinNPT paper, which includes UQ estimates for ProteinNPT for the 3 CV schemes, for the different uncertainty schemes and for ~100 assays. Is that something that would be helpful to you?

Best, Pascal

benjaminalbert commented 9 months ago

Yes, that data would be very helpful. I look forward to it, and once again, I appreciate your help.

benjaminalbert commented 6 months ago

Hi @pascalnotin, we were wondering if you could also please provide the data used to generate figure 2 (multiples mutants performance) and whether UQ values were calculated?

pascalnotin commented 6 months ago

Hi @benjaminalbert -- we did not compute UQ values on multiples, but the data used to generate figure 2 can be found here: https://docs.google.com/spreadsheets/d/1jygsC0CDlxYUY2-YveJ-58yEhMKcwn4JY_D8Toc-5IA/edit?usp=sharing

benjaminalbert commented 6 months ago

Thank you very much, Pascal! Lastly, if you have MSEs and metrics calculated per fold (so that others can compare with statistical tests), we would greatly appreciate it.

pascalnotin commented 6 months ago

Hi @benjaminalbert -- I just added the per fold metrics to the same google sheet.

benjaminalbert commented 6 months ago

Great, thank you very much!

pascalnotin commented 6 months ago

Hi @benjaminalbert -- quick note on the latest results I shared: by default we standard normalize target values for modeling, which has an impact on the actual MSE performance values we report (it uniformly impacts all models/baselines we compare against though). Spearman performance being scale-independent is not impacted.

benjaminalbert commented 6 months ago

Hi @pascalnotin, thanks for the note. I saw in the ProteinNPT repo that the targets are standardized by the mean and std of the 3 training fold targets for each iteration of 5-fold CV.