OATML-Markslab / ProteinGym

Official repository for the ProteinGym benchmarks
MIT License
205 stars 19 forks source link

Possible inconsistencies with DMS ID, DOI, and selection type #26

Open agitter opened 4 months ago

agitter commented 4 months ago

Thanks for the excellent resource and making all the data so easily accessible. While combing through the csv files, we noticed a few possible inconsistencies I wanted to ask about.

DMS ID

In reference_files/DMS_substitutions.csv datasets like ARGR_ECOLI_Tsuboyama_2023_1AOY from the mega scale stability experiment are named after Tsuboyama, e.g. ARGR_ECOLI_Tsuboyama_2023_1AOY. That is also the convention in benchmarks/DMS_zero_shot/substitutions/Spearman/DMS_substitutions_Spearman_DMS_level.csv. However, in benchmarks/DMS_supervised/substitutions/Spearman/DMS_substitutions_Spearman_DMS_level.csv they are named after Rocklin, e.g. ARGR_ECOLI_Rocklin_2023_1AOY.

DOI

The same mega scale study appears to have multiple journal DOIs listed in the jo column of reference_files/DMS_substitutions.csv. The first 10.1038/s41586-023-06328-6 is correct but the following increment the final position incorrectly, e.g. 10.1038/s41586-023-06328-7, 10.1038/s41586-023-06328-8.

Selection type

In https://marks.hms.harvard.edu/proteingym/DMS_supervised_substitutions_scores.zip the targets column has the value fitness or fitness_unsupervised_prediction for all rows. Some of these assays have other selection types in reference_files/DMS_substitutions.csv.

pascalnotin commented 4 months ago

Hi Anthony - thank you very much for flagging all of these, we will fix them all in the next update!

brycejoh16 commented 3 months ago

Hi @pascalnotin ,

I ended up manually making a mapping of what DMS_id's were in the scoring file: https://marks.hms.harvard.edu/proteingym/DMS_supervised_substitutions_scores.csv

Compared the DMS_id's that are used to represent the sequences in the cross validation splits https://marks.hms.harvard.edu/proteingym/cv_folds_singles_substitutions.zip

I made the mapping by manually inspecting what DMS_id looked to go with the other one. Please check to make sure the each DMS_id in the scoring file corresponds to the correct DMS_id in the splits zip file.

Despite both containing 217 unique DMS_id's, two DMS_id's in the scoring file had no obvious mapping to the split DMS_id's in the split zip file.

Again thanks for providing protein gym as a resource. It is a great reference and way to discover new DMS datasets and explore models. Thanks again for maintaining this resource.

Here is the file of the 88 DMS_id's that are in the scoring file, but not in the cross validation split zip file, and my best guess and the mapping for each one. If you use please verify that these are correct!

missed_dms_ids.csv

BarKetPlace commented 2 months ago

Hi, I am adding a minor question in this thread: In reference_files/DMS_substitutions.csv , KCNJ2_MOUSE is listed as Human/Homo Sapiens is this correct ?