OATML-Markslab / ProteinGym

Official repository for the ProteinGym benchmarks
https://proteingym.org/
MIT License
238 stars 25 forks source link

Reference files for DMS data #56

Open ArXiv-sketch opened 4 days ago

ArXiv-sketch commented 4 days ago

Hi, thank you very much for this paper and codebase. Sorry if my question is super simple, but I was curious about the DMS_filename entries you referenced in DMS_substitutions.csv. I am trying my best to find the exact filenames you provided in the CSV file, but I can't find most of those DMS data where I would have mutation sequences for the referenced protein. For example, I am trying to look for CAS9_STRP1_Spencer_2017_positive.csv, where I would have thousands of mutated sequences along with the target sequence and possibly LFC after positive selection. When you mentioned those names in the file, where can we reference them to get them?

The reason why I am also asking is that I wanted to run the notebook code, but I need DMS_reference_file like those .csv documents. I know each of these reference papers has their data in supplementary. Still, I checked that the files are not named in the same manner, + I feel like, for example, you are choosing some specific criteria, like positive selection for Cas9, which might not be the only functionality criteria in the reference papers.

ArXiv-sketch commented 4 days ago

I could only find some .csv files from this link, but it seems to be outdated as there are way more DMS substitution files included in the ProteinGym paper.