dariush-bahrami / character-tokenizer

A character tokenizer for Hugging Face Transformers
MIT License
27 stars 13 forks source link

Confusion Regarding EOS Token #1

Closed a-marx closed 8 months ago

a-marx commented 8 months ago

Thank you for providing this great library. I am relatively new to NLP, therefore it might be that I misunderstood something. Could please explain why [SEP] is assigned instead of [EOS] in the following line?

https://github.com/dariush-bahrami/character-tokenizer/blob/94a5d5b7a37369b69c2d3c8afe2bc368a94a43a3/charactertokenizer/core.py#L36

It seems like the same token is used for <sep> and <eos>. Thank you in advance!

dariush-bahrami commented 8 months ago

I suggest you read the source code of Canine Tokenizer at:

https://github.com/huggingface/transformers/blob/bd469c40659ce76c81f69c7726759d249b4aef49/src/transformers/models/canine/tokenization_canine.py#L82

The short answer is this is a convention generally adapted from BERT. See the link below:

https://huggingface.co/docs/transformers/model_doc/bert#transformers.BertTokenizer

Specialy the following segment of the docs:

sep_token (str, optional, defaults to "[SEP]") — The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for sequence classification or for a text and a question for question answering. It is also used as the last token of a sequence built with special tokens.

let me know if need more help.