I am trying to generate language model embeddings with ESM2.
However, some of the protein sequences that I have contain AAs that are not in the vocabulary of the language model.
You could replace them with a mask token, or map them to the closest natural amino acid. FYI there was some occurence of Ambiguous Amino Acids (X, B, Z) in the Uniref training data.
Thanks for ESM!
I am trying to generate language model embeddings with ESM2. However, some of the protein sequences that I have contain AAs that are not in the vocabulary of the language model.
Currently, I am replacing them with '-'. After reading this response in an issue https://github.com/facebookresearch/esm/issues/300#issuecomment-1262447466 I was thinking that my approach might be a bad idea.
What would be the best approach to deal with the uncommon amino acids that are not in the vocabulary such as 'MSE'/SELENOMETHIONINE?
Thanks!