Closed a-marx closed 8 months ago
I suggest you read the source code of Canine Tokenizer at:
The short answer is this is a convention generally adapted from BERT. See the link below:
https://huggingface.co/docs/transformers/model_doc/bert#transformers.BertTokenizer
Specialy the following segment of the docs:
sep_token (str, optional, defaults to "[SEP]") — The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for sequence classification or for a text and a question for question answering. It is also used as the last token of a sequence built with special tokens.
let me know if need more help.
Thank you for providing this great library. I am relatively new to NLP, therefore it might be that I misunderstood something. Could please explain why
[SEP]
is assigned instead of[EOS]
in the following line?https://github.com/dariush-bahrami/character-tokenizer/blob/94a5d5b7a37369b69c2d3c8afe2bc368a94a43a3/charactertokenizer/core.py#L36
It seems like the same token is used for
<sep>
and<eos>
. Thank you in advance!