facebookresearch / esm

Evolutionary Scale Modeling (esm): Pretrained language models for proteins
MIT License
3.16k stars 627 forks source link

What is the best way to gather the representation of the protein? #315

Closed smiles724 closed 1 year ago

smiles724 commented 1 year ago

Hi, thanks for providing such useful tools! However, I wonder what is the best way to gather the representation of the protein.

To be specific, the official document gives an example to generate per-sequence representations via averaging token-level representations. But as you know, transformer-based models in NLP prefer using the first token [CLS] as the per-sequence representation.

Do you try these two different methods? Or it may depend empirically on the task that I choose?

tomsercu commented 1 year ago

In the frontpage README we mention:

mean includes the embeddings averaged over the full sequence, per layer. bos includes the embeddings from the beginning-of-sequence token. (NOTE: Don't use with the pre-trained models - we trained without bos-token supervision)

In general we don't expend the bos (equivalent to CLS) token to have meaningful representations as it hasn't been supervised.

smiles724 commented 1 year ago

Thanks for your reply.