facebookresearch / esm

Evolutionary Scale Modeling (esm): Pretrained language models for proteins
MIT License
3.16k stars 627 forks source link

How come the networks include <cls>, <eos>, <unk> and other similar tokens? #18

Closed tueboesen closed 3 years ago

tueboesen commented 3 years ago

This is not an issue, so I apologize for putting it here, but I didn't really know where else to ask.

I have been testing out the various pretrained networks you have trained in this repository, and they seem very interesting and I might use them in a paper I'm working on, so I would like to understand it in detail. One thing I do not understand about the networks is why they include so many special tokens? I get that you need the masking token, and similarly the padding token for handling proteins batched together with various sizes. The cls and eos are used just before and after a protein, but seem unnecessary for proteins unless I'm missing something? The unk token should signal that an amino acid is unknown if I understand correctly, but isn't X generally the catch all case in protein language for unknown amino acids? So what is the usecase here? And similarly for the last few tokens used which I have no good guess for.

tomsercu commented 3 years ago

For now github issues are a good place to ask questions 👍 You're right, there are a number of tokens in the vocab which have no good reason to be there. We use fairseq to train the models and largely stick to their conventions when it comes to vocab. The unusual tokens are completely unseen in training data, so shouldn't be used. But their dummy presence shouldn't hurt either.

joshim5 commented 3 years ago

@tueboesen To clarify further, it's important to follow the conventions if you use these models for downstream tasks. For example, cls/eos need to be appended and prepended to the sequences to get the best performance. Thanks for your interest and let us know if you have any more questions!

jiosephlee commented 7 months ago

@joshim5 @tomsercu Just to jump in, I have a few quick follow-up questions: "The unusual tokens are completely unseen in training data" does this apply to cls/eos tokens as well? I'd be surprised if CLS tokens improve performance for downstream applications without having seen them. Also, is there a need to manually append/prepend cls/eos tokens? It seems like the hugging face version of the tokenizer is automatically adding these tokens.

"to get the best performance" does this also depend on the fact that the CLS token is used for classifiers? For some context, for other models like BERT or ViTs, I'm seeing arguments for average pooling of the token embeddings rather than the CLS token. I'm curious if there's a recommendation for ESM.

gorj-tessella commented 5 months ago

I also have this question. I noticed in the Huggingface code, EsmForSequenceClassification uses EsmClassificationHead which use only the encoding at token position 0 which should be <cls>, noting "take \<s> token (equiv. to [CLS])". This is obviously different from the "mean_representations" value typically generated by extract.py, which is the average over the used tokens, not including the <cls> and <eos> tokens.

Is there some justification for using the <cls> token embedding vs. the mean sequence token embedding?