facebookresearch / esm

Evolutionary Scale Modeling (esm): Pretrained language models for proteins
MIT License
3.16k stars 627 forks source link

use of bos and eos token when training #299

Closed zhenyuhe00 closed 2 years ago

zhenyuhe00 commented 2 years ago

Hi, Congrats on your great work. I wonder did you use prepend bos and eos for every cropped sequence, or did you prepend bos and eos only for the complete protein sequence?

Thanks in advance!

tomsercu commented 2 years ago

On all sequences, even cropped ones.

zhenyuhe00 commented 2 years ago

Thanks!

felbecker commented 1 year ago

@tomsercu In the ESM2 paper it says the opposite:

We used BOS and EOS tokens to signal the beginning and end of a real protein, to allow the model to separate a full-sized protein from a cropped one

y-hwang commented 1 year ago

@tomsercu I have the same question as @felbecker. Were partial/truncated proteins prepended and appended with bos/eos during training? thank you!