BytedProtein / ByProt

Apache License 2.0
141 stars 13 forks source link

ESM data leakage #3

Closed Ieremie closed 8 months ago

Ieremie commented 1 year ago

If protein sequence embeddings are used from ESM, I believe there might be some data leakage for the inverse folding prediction. Namely, protein structures in the CATH datasets could have appeared in the sequence pretraining dataset of the ESM.

Recovering the sequence of a CATH structure that was previously seen by ESM during pretraining.

"MASKED INVERSE FOLDING WITH SEQUENCE TRANSFER FOR PROTEIN REPRESENTATION LEARNING" Here they pretrain the language model with a filtered dataset to remove those sequences that appear in the CATH dataset.

zhengzx-nlp commented 11 months ago

hey Leremie (@Ieremie),

Thank you so much for your question! There must be something weird because I didn't get notified of your issue so sorry for late reply.

Actually, we did think of this concern about data leakage, and one of our reviewers also raised the same question, so here is our thoughts on this:

  1. A protein sequence language model such as ESMs can by no means design sequences for given structures on their own, regardless of whether or not the proteins are part of a pLM's training data. A workaround to use pLMs for structure-based sequence design is to "revise" the predicted sequences from an existing inverse folding model (e.g., ProteinMPNN), as what we did in our preliminary study for proof-of-concept. There are gains but moderate (~49% -> ~50% in sequence recovery), although nearly half (i.e., 49%) of the input sequences were identical to the native ones. Instead, our approach endows a pLM with the capability of structure-based sequence design. This means that even for the structure of a native protein sequence present in the protein sequence pretraining data, we suggest that it does meet the expectation for the model to predict sequences that are correlated to this native one.
  2. To further eliminate the potential concern of model memorization and ensure that our model is capable of generalizing, we conducted experiments on a set of recently released proteins from the protein data bank (PDB). Specifically, we curated a dataset of 80 proteins whose sequences were released in 2022, later than the release date of ESM-1b in 2019. Our results show that our approach achieves 52.5% in sequence recovery with 0.86 in scTM score whereas 45.6% and 0.79 for ProteinMPNN, verifying our model's ability to design sequences for new protein structures that were not present in its training data.
seq. recovery scTMScore
ProteinMPNN-CMLM 45.6% 0.79
LM-DESIGN 52.5% 0.86

Hopefully I could address your issue. Feel free to reach out to me if you have any further questions or suggestions!