Open Evertt opened 11 months ago
Never mind, apparently I had to call const encoded = tik.encode(input, "all")
instead of const encoded = tik.encode(input)
.
I'm still a little confused why I need to do that though. And whether "all"
is really the best option. Should I use "all"
or should I use ["<|im_start|>", "<|im_end|>", "<|im_sep|>"]
?
Hi @Evertt!
The reason why all
is optional here is to mimic the same behaviour as seen in the official openai/tiktoken
library, where it is assumed that we're encoding user input directly.
all
will include other special tokens as well, which may not be desired, depending on the input being received. Namely:
"<|endoftext|>": 100257,
"<|fim_prefix|>": 100258,
"<|fim_middle|>": 100259,
"<|fim_suffix|>": 100260,
"<|endofprompt|>": 100276
Using:
And then trying to encode this string:
I get the following error:
You'd think that that might imply that the wasm already contains those special tokens. However, when I try to encode that string without adding those special tokens manually, then the output is this:
Which doesn't line up with what I get in the tiktokenizer demo on vercel: