facebookresearch / esm

Evolutionary Scale Modeling (esm): Pretrained language models for proteins
MIT License
3.16k stars 627 forks source link

How to deal with Amino Acids that are not in the vocabulary #301

Closed HannesStark closed 1 year ago

HannesStark commented 2 years ago

Thanks for ESM!

I am trying to generate language model embeddings with ESM2. However, some of the protein sequences that I have contain AAs that are not in the vocabulary of the language model.

Currently, I am replacing them with '-'. After reading this response in an issue https://github.com/facebookresearch/esm/issues/300#issuecomment-1262447466 I was thinking that my approach might be a bad idea.

What would be the best approach to deal with the uncommon amino acids that are not in the vocabulary such as 'MSE'/SELENOMETHIONINE?

Thanks!

tomsercu commented 1 year ago

You could replace them with a mask token, or map them to the closest natural amino acid. FYI there was some occurence of Ambiguous Amino Acids (X, B, Z) in the Uniref training data.