Open agitter opened 4 months ago
Hi Anthony - thank you very much for flagging all of these, we will fix them all in the next update!
Hi @pascalnotin ,
I ended up manually making a mapping of what DMS_id's were in the scoring file: https://marks.hms.harvard.edu/proteingym/DMS_supervised_substitutions_scores.csv
Compared the DMS_id's that are used to represent the sequences in the cross validation splits https://marks.hms.harvard.edu/proteingym/cv_folds_singles_substitutions.zip
I made the mapping by manually inspecting what DMS_id looked to go with the other one. Please check to make sure the each DMS_id in the scoring file corresponds to the correct DMS_id in the splits zip file.
Despite both containing 217 unique DMS_id's, two DMS_id's in the scoring file had no obvious mapping to the split DMS_id's in the split zip file.
Again thanks for providing protein gym as a resource. It is a great reference and way to discover new DMS datasets and explore models. Thanks again for maintaining this resource.
Here is the file of the 88 DMS_id's that are in the scoring file, but not in the cross validation split zip file, and my best guess and the mapping for each one. If you use please verify that these are correct!
Hi, I am adding a minor question in this thread:
In reference_files/DMS_substitutions.csv
, KCNJ2_MOUSE is listed as Human/Homo Sapiens
is this correct ?
Thanks for the excellent resource and making all the data so easily accessible. While combing through the csv files, we noticed a few possible inconsistencies I wanted to ask about.
DMS ID
In
reference_files/DMS_substitutions.csv
datasets like ARGR_ECOLI_Tsuboyama_2023_1AOY from the mega scale stability experiment are named after Tsuboyama, e.g. ARGR_ECOLI_Tsuboyama_2023_1AOY. That is also the convention inbenchmarks/DMS_zero_shot/substitutions/Spearman/DMS_substitutions_Spearman_DMS_level.csv
. However, inbenchmarks/DMS_supervised/substitutions/Spearman/DMS_substitutions_Spearman_DMS_level.csv
they are named after Rocklin, e.g. ARGR_ECOLI_Rocklin_2023_1AOY.DOI
The same mega scale study appears to have multiple journal DOIs listed in the jo column of
reference_files/DMS_substitutions.csv
. The first 10.1038/s41586-023-06328-6 is correct but the following increment the final position incorrectly, e.g. 10.1038/s41586-023-06328-7, 10.1038/s41586-023-06328-8.Selection type
In https://marks.hms.harvard.edu/proteingym/DMS_supervised_substitutions_scores.zip the targets column has the value fitness or fitness_unsupervised_prediction for all rows. Some of these assays have other selection types in
reference_files/DMS_substitutions.csv
.