Some question about benckmark

Meron-TANG commented 2 months ago

Hi there :

In the benchmark of proteinGym substitutions ,i see three data splitting methods are evaluated separately. Is this substitutions benchmark trained on the single mutation scanning dataset (~690k)? Or is it combined with the single- and multi-point mutation data for training (~2.7M)? If it is combined and trained together, I don’t seem to see the multi-mutation cv data file cv_folds_multiples_substitutions contains the fold_random_5, fold_modulo_5 and fold_contiguous_5 fold columns like the single cv mutation dataset owned.

In addition, I also want to know when building the benchmark, is a separate model trained for each task data? If so, how to choose a suitable model for the prediction of proteins outside the dataset?

Thanks

brycejoh16 commented 1 month ago

I'm not on the protein gym team, however, in the paper it mentions that fold_random_5, fold_modulo_5 and fold_contiguous_5 are only for singles the substitutions_singles folder. This would be Random, Contiguous, and Modulo evaluations in the paper.

Regarding doubles, I pretty sure that cv_folds_multiples_substitutions only contains the full DMS datasets that have multiple mutants, but if they have single mutants their single mutants are included in substitutions_singles. Multiple mutants were evaluated in the Random split only, as Contiguous and Modulo are position based and would likely be substantially more complicated in this setting.

Hope that helps!

However, someone on the protein gym team may correct me :)

Meron-TANG commented 1 month ago

Thank you @brycejoh16 for your reply, it was really helpful for me to understand single mutant substitutions benchmark dataset. However, for multiple substitutions dataset, is there any benchmark score like single mutant benchmark presented in proteinGym website?

brycejoh16 commented 1 month ago

I personally haven't seen a benchmark score for this data type (multi mutants), maybe protein gym team is keeping that internal for now?

pascalnotin commented 2 days ago

Hi @Meron-TANG,

Thank you for the question! To clarify the construction of the various CV schemes:

CV folds - substitutions - singles (https://marks.hms.harvard.edu/proteingym/cv_folds_singles_substitutions.zip): these files provide the 3 CV schemes (random, modulo, contiguous) for all 217 assays in the ProteinGym substitution benchmark. As @brycejoh16 pointed out already, these only include single amino acid substitutions (ie., if an assay contains multiple mutants, these mutants are removed) given the practical challenges to assign multiple mutants to folds in the modulo and contiguous schemes.
CV folds - substitutions - multiples (https://marks.hms.harvard.edu/proteingym/cv_folds_multiples_substitutions.zip): these files provide the random CV scheme for all assays in the ProteinGym substitution benchmark with at least one multiple mutant. Note that we keep all mutants in these files (singles and multiples), with the mutational depth being provided in the mutation_depth field. While there is no performance benchmark yet on these files yet (the ProteinGym paper focused on comparing performance across the 3 CV schemes on single mutants across all 217 assays), we had one analysis on a subset of these assays in the ProteinNPT paper (see section 4.4 and Figure 2) -- this was run on ProteinGym v0.1 though / did not have as many assays with multiple mutants as we have now.

Thanks a lot to @brycejoh16 for his (much quicker) response to the above!

Kind regards, Pascal

OATML-Markslab / ProteinGym

Some question about benckmark #32