facebookresearch / esm

Evolutionary Scale Modeling (esm): Pretrained language models for proteins
MIT License
3.16k stars 627 forks source link

ESM2 Language Head can not correctly decode Embeddings #313

Closed Leo-T-Zang closed 1 year ago

Leo-T-Zang commented 1 year ago

Hi,

I am currently using ESM2 Language Head to decode the embedding. I use protein sequences in ProRef50 (some Zinc-Finger Proteins). These proteins are assumed in the training dataset of ESM2, so I believe it should work. However, the decoding results are somehow different with the original protein sequences.

Is it normal that the language head can not decode correctly? What am I supposed to do in this case?

tomsercu commented 1 year ago

Even though MLM is the pretraining objective which allows the model to learn meaningful representations, the model does not get perfect at the pretext task (this would correspond to perplexity 1 or all probability mass on the correct token). So we do not expect the model to perfectly be able to "decode" embeddings. If the sequences are well-understood by the model (most, but not all of the training data), then we expect all amino acid in the input sequence to also have high output probability, this coresponds to a low perplexity for that sequence.