OATML-Markslab / ProteinGym

Official repository for the ProteinGym benchmarks
https://proteingym.org/
MIT License
225 stars 23 forks source link

How are the DMS score calculated and normalized #52

Closed Joey-Xue closed 1 week ago

Joey-Xue commented 1 week ago

Thanks for providing the benchmark. I was recently trying benchmark some models on the dataset and found the DMS scores highly diverse for regression. For most of the assays, the max-min DMS score range from around -5 to 5, but for some assays like Q6WV13_9MAXI, D7PM05_CLYGR, the number ranges from 589 to 40000, which is clearly not in the same scale. And for B2L11_HUMAN the min_max activity is 2640756.73 ~ 100215199.65. It seems that the score is not a normalized metric. I checked the github repository and ProteinGym paper but didn't find how the score was defined. Could you provide more clear definition of the score? Thanks in advance for any kind help

pascalnotin commented 1 week ago

Hi @Joey-Xue -- thanks for your question! We do not rescale the experimental scores on purpose, as different groups may have different views on the best approach to normalize them depending on their objectives. So the scale of the DMS score is identical to the scale of the measured phenotype in the original paper they were obtained from (see the "raw_DMS_phenotype_name" column in the reference file). What we do however is to correct the sign of scores as needed, such that a positive DMS_score always corresponds to "higher fitness". In prior work, when developing (semi-)supervised models on the DMS data (eg., in ProteinNPT), we found it helpful to standard normalize the scores before training. Best, Pascal

Joey-Xue commented 1 week ago

Hello Pascal, thanks for your quick and comprehensive reply! The reference file provides exactly the information I need! So in the benchmark on protengym website, I wonder how the zero shot regression was performed as the DMS score was in different scales? One more question, the download link for the full proteinGYM-substitution dataset is not accessible now: https://marks.hms.harvard.edu/proteingym/DMS_ProteinGym_substitutions.zip Since I want to download the newest version of the dataset. Would you mind check when would the downloadable dataset be available again? Thanks again for your kind support.

pascalnotin commented 1 week ago

Hi @Joey-Xue,

For the first question, we standard normalize the targets in the regression setting, as done here.

For the second question, we recently added versioning to the benchmarks and updated the download links as described in the revised README here.

Joey-Xue commented 1 week ago

Hi Pascal, thanks for your kind reply! Was pretty helpful