Partial AlphaFold structures are non-full-length

Hi, great work! This provides a very useful benchmark for representation learning of proteins.

However, in my attempts to test this dataset with my models, I realized that there are some provided structures that may be incomplete, and of course, this may confusing only for structure-based models.

E.g., the AlphaFold structure corresponding to A0A140D2T1_ZIKV_Sourisseau_2019 appears to be missing a portion of the structure (1-504 given, but N729 in DMS data), which results in some mutations not corresponding to the structure. This bug is due to a mis-coding of an amino acid in the structure, when in fact the first amino acid in the structure is the 291st amino acid of the sequence. I don't know if there are other proteins with a similar problem.

If you can provide a complete (or correctly numbered) structural dataset, it may help all users to test their models on the same structural data to promote fairness in benchmarking. Thanks!

OATML-Markslab / ProteinGym

Partial AlphaFold structures are non-full-length #29