Closed stephenasuncionDEV closed 1 year ago
Hello @stephenasuncionDEV!
The easiest thing would be to spread your special tokens.
const encoding = new Tiktoken(
model.bpe_ranks,
{
...model.special_tokens,
"<|im_start|>": 100264,
"<|im_end|>": 100265,
"<|im_sep|>": 100266,
},
model.pat_str,
);
Hello @stephenasuncionDEV!
The easiest thing would be to spread your special tokens.
const encoding = new Tiktoken( model.bpe_ranks, { ...model.special_tokens, "<|im_start|>": 100264, "<|im_end|>": 100265, "<|im_sep|>": 100266, }, model.pat_str, );
I seem to be getting "The text containing a special token that is not allowed: <|im_start|>" when extending the special token object.
export const getTokenCount = async (
messages: ProviderMessage[],
model: ProviderModel,
) => {
await init((imports) => WebAssembly.instantiate(wasm, imports));
const encoding = new Tiktoken(
cl100k_base.bpe_ranks,
{
...cl100k_base.special_tokens,
"<|im_start|>": 100264,
"<|im_end|>": 100265,
"<|im_sep|>": 100266,
},
cl100k_base.pat_str,
);
const isGpt3 = model === "gpt-3.5-turbo";
const msgSep = isGpt3 ? "\n" : "";
const roleSep = isGpt3 ? "\n" : "<|im_sep|>";
const serialized = [
messages
.map(({ role, content }) => {
return `<|im_start|>${role}${roleSep}${content}<|im_end|>`;
})
.join(msgSep),
`<|im_start|>assistant${roleSep}`,
].join(msgSep);
const tokens = encoding.encode(serialized);
encoding.free();
return tokens.length;
};
By default if an special token is passed in encode
, the encoder aborts.
Use the following code:
const tokens = encoding.encode(serialized, "all");
or
const tokens = encoding.encode(serialized, ["<|im_start|>", "<|im_sep|>", "<|im_end|>"]);
Solved, Thank you!
How would I extend the special tokens with the Lite package? Do I need to extend it and add the im_start etc.. or do I just serialize the chat message and pass it into the encoder?
I'm trying to extend special tokens on the lite package for chatML like the following:
Originally posted by @dqbd in https://github.com/dqbd/tiktoken/issues/23#issuecomment-1483880292