How to deal with Amino Acids that are not in the vocabulary

Thanks for ESM!

I am trying to generate language model embeddings with ESM2. However, some of the protein sequences that I have contain AAs that are not in the vocabulary of the language model.

Currently, I am replacing them with '-'. After reading this response in an issue https://github.com/facebookresearch/esm/issues/300#issuecomment-1262447466 I was thinking that my approach might be a bad idea.

What would be the best approach to deal with the uncommon amino acids that are not in the vocabulary such as 'MSE'/SELENOMETHIONINE?

Thanks!

facebookresearch / esm

How to deal with Amino Acids that are not in the vocabulary #301