Closed teetone closed 2 years ago
hi, '<s>', '</s>'
are icetk[20001] and icetk[20002], but you can also add your own special tokens.
hi,
'<s>', '</s>'
are icetk[20001] and icetk[20002], but you can also add your own special tokens.
@Sleepychord Thank you for the reply. Could I check if the following is expected:
from icetk import icetk as tokenizer
>>> tokenizer.encode('</s>')
[20098, 20106, 20033]
>>> tokenizer.encode('<s>')
[20046, 20106, 20033]
>>> tokenizer.decode([20001])
''
>>> tokenizer.decode([20002])
''
I would expect tokenizer.encode('</s>')
to yield [20002].
hi,
'<s>', '</s>'
are icetk[20001] and icetk[20002], but you can also add your own special tokens.@Sleepychord Thank you for the reply. Could I check if the following is expected:
from icetk import icetk as tokenizer >>> tokenizer.encode('</s>') [20098, 20106, 20033] >>> tokenizer.encode('<s>') [20046, 20106, 20033] >>> tokenizer.decode([20001]) '' >>> tokenizer.decode([20002]) ''
I would expect
tokenizer.encode('</s>')
to yield [20002].
Hi @Sleepychord, can I follow up on this?
Oh, they're control token and will not be printed. They are designed to add manually, instead of from tokenization, in case there are real strings "<s>"and "</s>"
. If you want to add some in-text separation symbol. Please use icetk.add_special_tokens(token_list)
at the first time you importing icetk. Their ids are appended to the dictionary.
@Sleepychord That makes sense. Thank you!
I can get the end-of-text token for Hugging Face tokenizers by
eos_token
:I was wondering if there is something similar for the ICE tokenizer.