facebookresearch / esm

Evolutionary Scale Modeling (esm): Pretrained language models for proteins
MIT License
2.97k stars 586 forks source link

how to input cropped protein for ESM-2 ? #651

Open GriffithLin opened 5 months ago

GriffithLin commented 5 months ago

Hi ! I have problem when I use ESM-2 to embedding long protein sequence. For a long protein sequence, it needs to be cropped to a sequence with a length less than 1024, and BOS and EOS tokens are used to signal the beginning and end of a real protein. My question is how to input a sequence that contains only a BOS or an EOS, or none of them? Thanks in advance.

amgcasueshavoc commented 5 months ago

You do not always need BOS and EOS tokens, even if you don’t have a transformer decoder. However, if you are fine-tuning ESM-2 for a specific downstream task, where you intend to use BOS and EOS tokens, then you would include them as special tokens.