Closed Ieremie closed 8 months ago
hey Leremie (@Ieremie),
Thank you so much for your question! There must be something weird because I didn't get notified of your issue so sorry for late reply.
Actually, we did think of this concern about data leakage, and one of our reviewers also raised the same question, so here is our thoughts on this:
seq. recovery | scTMScore | |
---|---|---|
ProteinMPNN-CMLM | 45.6% | 0.79 |
LM-DESIGN | 52.5% | 0.86 |
Hopefully I could address your issue. Feel free to reach out to me if you have any further questions or suggestions!
If protein sequence embeddings are used from ESM, I believe there might be some data leakage for the inverse folding prediction. Namely, protein structures in the CATH datasets could have appeared in the sequence pretraining dataset of the ESM.
Recovering the sequence of a CATH structure that was previously seen by ESM during pretraining.
"MASKED INVERSE FOLDING WITH SEQUENCE TRANSFER FOR PROTEIN REPRESENTATION LEARNING" Here they pretrain the language model with a filtered dataset to remove those sequences that appear in the CATH dataset.