Closed Leo-T-Zang closed 2 years ago
Even though MLM is the pretraining objective which allows the model to learn meaningful representations, the model does not get perfect at the pretext task (this would correspond to perplexity 1 or all probability mass on the correct token). So we do not expect the model to perfectly be able to "decode" embeddings. If the sequences are well-understood by the model (most, but not all of the training data), then we expect all amino acid in the input sequence to also have high output probability, this coresponds to a low perplexity for that sequence.
Hi,
I am currently using ESM2 Language Head to decode the embedding. I use protein sequences in ProRef50 (some Zinc-Finger Proteins). These proteins are assumed in the training dataset of ESM2, so I believe it should work. However, the decoding results are somehow different with the original protein sequences.
Is it normal that the language head can not decode correctly? What am I supposed to do in this case?