THUDM / icetk

A unified tokenization tool for Images, Chinese and English.
150 stars 17 forks source link

what‘s the meaning of token 20005? #7

Closed xu-song closed 1 week ago

xu-song commented 1 year ago
tokens = icetk.encode('你好世界!这里是 icetk。')
for token in tokens:
    print(token, icetk.text_tokenizer.proto.pieces[token - 20000].piece)
20005 ▁
94874 你好
84097 世界
20035 !
94947 这里是
22881 ▁ice
35955 tk
83823 。

what is "▁" used for?