Hi, great work! This provides a very useful benchmark for representation learning of proteins.
However, in my attempts to test this dataset with my models, I realized that there are some provided structures that may be incomplete, and of course, this may confusing only for structure-based models.
E.g., the AlphaFold structure corresponding to A0A140D2T1_ZIKV_Sourisseau_2019 appears to be missing a portion of the structure (1-504 given, but N729 in DMS data), which results in some mutations not corresponding to the structure. This bug is due to a mis-coding of an amino acid in the structure, when in fact the first amino acid in the structure is the 291st amino acid of the sequence. I don't know if there are other proteins with a similar problem.
If you can provide a complete (or correctly numbered) structural dataset, it may help all users to test their models on the same structural data to promote fairness in benchmarking. Thanks!
Hi, great work! This provides a very useful benchmark for representation learning of proteins.
However, in my attempts to test this dataset with my models, I realized that there are some provided structures that may be incomplete, and of course, this may confusing only for structure-based models.
E.g., the AlphaFold structure corresponding to A0A140D2T1_ZIKV_Sourisseau_2019 appears to be missing a portion of the structure (1-504 given, but N729 in DMS data), which results in some mutations not corresponding to the structure. This bug is due to a mis-coding of an amino acid in the structure, when in fact the first amino acid in the structure is the 291st amino acid of the sequence. I don't know if there are other proteins with a similar problem.
If you can provide a complete (or correctly numbered) structural dataset, it may help all users to test their models on the same structural data to promote fairness in benchmarking. Thanks!