OATML-Markslab / ProteinGym

Official repository for the ProteinGym benchmarks
MIT License
218 stars 20 forks source link

Partial AlphaFold structures are non-full-length #29

Closed Wang-Lin-boop closed 4 months ago

Wang-Lin-boop commented 5 months ago

Hi, great work! This provides a very useful benchmark for representation learning of proteins.

However, in my attempts to test this dataset with my models, I realized that there are some provided structures that may be incomplete, and of course, this may confusing only for structure-based models.

E.g., the AlphaFold structure corresponding to A0A140D2T1_ZIKV_Sourisseau_2019 appears to be missing a portion of the structure (1-504 given, but N729 in DMS data), which results in some mutations not corresponding to the structure. This bug is due to a mis-coding of an amino acid in the structure, when in fact the first amino acid in the structure is the 291st amino acid of the sequence. I don't know if there are other proteins with a similar problem.

If you can provide a complete (or correctly numbered) structural dataset, it may help all users to test their models on the same structural data to promote fairness in benchmarking. Thanks!

pascalnotin commented 4 months ago

Dear @Wang-Lin-boop,

Thank you for the kind words! Please refer to the following issue about the same: https://github.com/OATML-Markslab/ProteinGym/issues/18#issuecomment-1879907634

Best, Pascal