Closed quailwwk closed 3 years ago
These correspond to:
<cls>
(classification token) in ESM-1 and MSA Transformer, this is the beginning of sentence token. <pad>
(padding token) Enables sequences of variable length in the same batch. The model ignores pad tokens.<unk>
(unknown token) If you use a token that isn't in the trained dictionary, the tokenizer will replace the unknown token with this so that inference will still work.
You can ignore .
and <null_1>
. Additional tokens are often included in embedding dictionaries in order to pad their size to a desired length for computational reasons.
I notice that there are several special tokens in the alphabet, which are neither amind-acids nor gap. What do they mean?