THUDM / icetk

A unified tokenization tool for Images, Chinese and English.
150 stars 17 forks source link

Retrieve the value of the end-of-text-token #1

Closed teetone closed 2 years ago

teetone commented 2 years ago

I can get the end-of-text token for Hugging Face tokenizers by eos_token:

t = AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-20b")
t.eos_token    # Output is '<|endoftext|>'

I was wondering if there is something similar for the ICE tokenizer.

Sleepychord commented 2 years ago

hi, '<s>', '</s>' are icetk[20001] and icetk[20002], but you can also add your own special tokens.

teetone commented 2 years ago

hi, '<s>', '</s>' are icetk[20001] and icetk[20002], but you can also add your own special tokens.

@Sleepychord Thank you for the reply. Could I check if the following is expected:

from icetk import icetk as tokenizer

>>> tokenizer.encode('</s>')
[20098, 20106, 20033]
>>> tokenizer.encode('<s>')
[20046, 20106, 20033]
>>> tokenizer.decode([20001])
''
>>> tokenizer.decode([20002])
''

I would expect tokenizer.encode('</s>') to yield [20002].

teetone commented 2 years ago

hi, '<s>', '</s>' are icetk[20001] and icetk[20002], but you can also add your own special tokens.

@Sleepychord Thank you for the reply. Could I check if the following is expected:

from icetk import icetk as tokenizer

>>> tokenizer.encode('</s>')
[20098, 20106, 20033]
>>> tokenizer.encode('<s>')
[20046, 20106, 20033]
>>> tokenizer.decode([20001])
''
>>> tokenizer.decode([20002])
''

I would expect tokenizer.encode('</s>') to yield [20002].

Hi @Sleepychord, can I follow up on this?

Sleepychord commented 2 years ago

Oh, they're control token and will not be printed. They are designed to add manually, instead of from tokenization, in case there are real strings "<s>"and "</s>". If you want to add some in-text separation symbol. Please use icetk.add_special_tokens(token_list) at the first time you importing icetk. Their ids are appended to the dictionary.

teetone commented 2 years ago

@Sleepychord That makes sense. Thank you!