facebookresearch / esm

Evolutionary Scale Modeling (esm): Pretrained language models for proteins
MIT License
3.16k stars 627 forks source link

"-" token in ESM1b alphabet #300

Closed adrienchaton closed 2 years ago

adrienchaton commented 2 years ago

Hi ESM team,

When looking at alphabet.all_toks from ESM1b there is a "-" token and I am not sure why it is there.

Is it for compatibility reason with MSA transformer ? --> then I guess it should not be used for ESM1b

Or do you have any use of gapped sequences during ESM1b pre-training too ? --> then it could make sense to use this token with ESM1b

My question arise from embedding an MSA sequence by sequence with ESM1b. No error is thrown when passing a gapped sequence but as far as I understand, if ESM1b was never shown gapped sequences during training, the result would be corrupted. Then we should rather drop gaps, embed and e.g. shift residue embeddings to their alignment position.

I hope that makes sense, thanks for your hints !

tomsercu commented 2 years ago

We didn't train ESM-1b with gap tokens, so using it as input to esm-1b will be completely out of distribution. Note the vocab has more tokens than are actually used so later projects like MSA Transformer can use the same vocab.

we should rather drop gaps, embed and e.g. shift residue embeddings to their alignment position

agree 100%.

adrienchaton commented 2 years ago

thanks @tomsercu you confirm what I thought I don't know if this has mislead other users ... maybe it could be worth a warning message if using an input token which was not trained on e.g. ESM1b gives a warning if forwarding some "-" tokens whereas MSA transformer doesn't give one