Confusion Regarding EOS Token

dariush-bahrami / character-tokenizer

A character tokenizer for Hugging Face Transformers

MIT License

27 stars 13 forks source link

I suggest you read the source code of Canine Tokenizer at:

https://github.com/huggingface/transformers/blob/bd469c40659ce76c81f69c7726759d249b4aef49/src/transformers/models/canine/tokenization_canine.py#L82

The short answer is this is a convention generally adapted from BERT. See the link below:

https://huggingface.co/docs/transformers/model_doc/bert#transformers.BertTokenizer

Specialy the following segment of the docs:

sep_token (str, optional, defaults to "[SEP]") — The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for sequence classification or for a text and a question for question answering. It is also used as the last token of a sequence built with special tokens.

let me know if need more help.

dariush-bahrami / character-tokenizer

Confusion Regarding EOS Token #1