ESM data leakage - Githubissues

hey Leremie (@Ieremie),

Thank you so much for your question! There must be something weird because I didn't get notified of your issue so sorry for late reply.

Actually, we did think of this concern about data leakage, and one of our reviewers also raised the same question, so here is our thoughts on this:

A protein sequence language model such as ESMs can by no means design sequences for given structures on their own, regardless of whether or not the proteins are part of a pLM's training data. A workaround to use pLMs for structure-based sequence design is to "revise" the predicted sequences from an existing inverse folding model (e.g., ProteinMPNN), as what we did in our preliminary study for proof-of-concept. There are gains but moderate (~49% -> ~50% in sequence recovery), although nearly half (i.e., 49%) of the input sequences were identical to the native ones. Instead, our approach endows a pLM with the capability of structure-based sequence design. This means that even for the structure of a native protein sequence present in the protein sequence pretraining data, we suggest that it does meet the expectation for the model to predict sequences that are correlated to this native one.
To further eliminate the potential concern of model memorization and ensure that our model is capable of generalizing, we conducted experiments on a set of recently released proteins from the protein data bank (PDB). Specifically, we curated a dataset of 80 proteins whose sequences were released in 2022, later than the release date of ESM-1b in 2019. Our results show that our approach achieves 52.5% in sequence recovery with 0.86 in scTM score whereas 45.6% and 0.79 for ProteinMPNN, verifying our model's ability to design sequences for new protein structures that were not present in its training data.

	seq. recovery	scTMScore
ProteinMPNN-CMLM	45.6%	0.79
LM-DESIGN	52.5%	0.86

Hopefully I could address your issue. Feel free to reach out to me if you have any further questions or suggestions!

BytedProtein / ByProt

ESM data leakage #3