Closed loretoparisi closed 1 year ago
Hello @loretoparisi , when using Chat Completion API, the (ChatML) message needs be serialized before sending it to the tokenizer. See https://tiktokenizer.vercel.app/ to understand how it (most likely) behaves
So does this means that I have to consider a serialized JSON message?
var str = "Correct the spelling and grammar\n\nShe no went to the market."
const messageObj = [{
role: "user",
content: str
}];
str = JSON.stringify(messageObj)
If I look at your app here you do like:
const enc = get_encoding("cl100k_base", {
"<|im_start|>": 100264,
"<|im_end|>": 100265,
"<|im_sep|>": 100266,
// TODO: very hacky
// "system name=": 900000,
// "assistant name=": 900001,
// "user name=": 900002,
});
const encoded = enc.encode(str, "all");
but in this case I'm getting 23 tokens, not 21!
@loretoparisi Use the following code snippet extracted from Tiktokenizer (will most likely include it directly in library later)
note: updated as of 25/03/2023
function getChatGPTEncoding(
messages: { role: string; content: string; name: string }[],
model: "gpt-3.5-turbo" | "gpt-4" | "gpt-4-32k"
) {
const isGpt3 = model === "gpt-3.5-turbo";
const encoder = encoding_for_model(model, {
"<|im_start|>": 100264,
"<|im_end|>": 100265,
"<|im_sep|>": 100266,
});
const msgSep = isGpt3 ? "\n" : "";
const roleSep = isGpt3 ? "\n" : "<|im_sep|>";
const serialized = [
messages
.map(({ name, role, content }) => {
return `<|im_start|>${name || role}${roleSep}${content}<|im_end|>`;
})
.join(msgSep),
`<|im_start|>assistant${roleSep}`,
].join(msgSep);
return encoder.encode(serialized, "all");
}
Amazing!
{ token: 100264, string: '<|im_start|>' }
{ token: 882, string: 'user' }
{ token: 198, string: '\n' }
{ token: 34192, string: 'Correct' }
{ token: 279, string: ' the' }
{ token: 43529, string: ' spelling' }
{ token: 323, string: ' and' }
{ token: 32528, string: ' grammar' }
{ token: 271, string: '\n\n' }
{ token: 8100, string: 'She' }
{ token: 912, string: ' no' }
{ token: 4024, string: ' went' }
{ token: 311, string: ' to' }
{ token: 279, string: ' the' }
{ token: 3157, string: ' market' }
{ token: 13, string: '.' }
{ token: 100265, string: '<|im_end|>' }
{ token: 198, string: '\n' }
{ token: 100264, string: '<|im_start|>' }
{ token: 78191, string: 'assistant' }
chars:62 words:11 tokens:20 token/word ratio:1.818 char/token ratio:3.1
We are almost there, maybe it's my fault, now it's 20 tokens. I did
/**
*
* @param {*} messages { role: string; content: string; name: string }
* @param {*} model "gpt-3.5-turbo" | "gpt-4" | "gpt-4-32k"
* @returns
*/
function getChatGPTEncoding(
messages = [],
model = "gpt-3.5-turbo" | "gpt-4" | "gpt-4-32k"
) {
const isGpt3 = model === "gpt-3.5-turbo";
const encoder = encoding_for_model(model, {
"<|im_start|>": 100264,
"<|im_end|>": 100265,
"<|im_sep|>": 100266,
});
const msgSep = isGpt3 ? "\n" : "";
const roleSep = isGpt3 ? "\n" : "<|im_sep|>";
const serialized = [
messages
.map(({ name, role, content }) => {
return `<|im_start|>${name || role}${roleSep}${content}<|im_end|>`;
})
.join(msgSep),
"<|im_start|>assistant",
].join(msgSep);
return { encoder: encoder, encoded: encoder.encode(serialized, "all") };
}//getChatGPTEncoding
var str = "Correct the spelling and grammar\n\nShe no went to the market."
const messages = [{
role: "user",
name: "",
content: str
}];
const { encoder, encoded } = getChatGPTEncoding(messages, "gpt-3.5-turbo");
for (let token of encoded) {
var tokenDecoded = (new TextDecoder().decode(encoder.decode([token])));
console.log({ token, string: tokenDecoded })
}
It does seem like there was a possibly undocumented change regarding how tokens are actually counted on OpenAI's side of things. (Code obtained from openai/openai-cookbook
)
Will investigate further to determine approximate behaviour, but I believe they most likely added an additional roleSep
after the last line.
@loretoparisi Updated the code sample above, now it seems to match well 😄
@dqbd thanks, did you released an update on npm? On the app now I see 22 tokens
while in node still getting 20 tokens with the function getChatGPTEncoding
as defined before.
Installed version from npm was @dqbd/tiktoken@^1.0.2
@dqbd According to the cookbook above I did
/**
* Returns the number of tokens used by a list of messages.
* @param {*} messages
* @param {*} model
* @returns
*/
function numTokensFromMessages(messages, model = "gpt-3.5-turbo-0301") {
var encoding;
try {
encoding = encoding_for_model(model)
} catch (KeyError) {
console.log("Warning: model not found. Using cl100k_base encoding.")
encoding = get_encoding("cl100k_base")
}
if (model == "gpt-3.5-turbo") {
console.log("Warning: gpt-3.5-turbo may change over time. Returning num tokens assuming gpt-3.5-turbo-0301.")
return numTokensFromMessages(messages, model = "gpt-3.5-turbo-0301")
} else if (model == "gpt-4") {
console.log("Warning: gpt-4 may change over time. Returning num tokens assuming gpt-4-0314.")
return numTokensFromMessages(messages, model = "gpt-4-0314")
} else if (model == "gpt-3.5-turbo-0301") {
tokens_per_message = 4 // every message follows <|start|>{role/name}\n{content}<|end|>\n
tokens_per_name = -1 // if there's a name, the role is omitted
} else if (model == "gpt-4-0314") {
tokens_per_message = 3
tokens_per_name = 1
} else {
throw new Error(`num_tokens_from_messages() is not implemented for model {model}. See https://github.com/openai/openai-python/blob/main/chatml.md for information on how messages are converted to tokens.`)
}
var num_tokens = 0
for (const message of messages) {
num_tokens += tokens_per_message
Object.keys(message).forEach(key => {
var value = message[key];
var encoded = encoding.encode(value);
num_tokens += encoded.length
if (key == "name") {
num_tokens += tokens_per_name
}
});
}
num_tokens += 3 // every reply is primed with <| start |> assistant <| message |>
return { encoder: encoding, num_tokens: num_tokens };
}//numTokensFromMessages
and in fact the count is 20:
var str = "Correct the spelling and grammar\n\nShe no went to the market."
const messages = [{
role: "user",
name: "",
content: str
}];
const num_tokens = numTokensFromMessages(messages, "gpt-3.5-turbo")
Warning: gpt-3.5-turbo may change over time. Returning num tokens assuming gpt-3.5-turbo-0301.
// num_tokens=20
while the api still 21! 😢
@loretoparisi There are various issues as far as I can see:
Make sure you are escaping "\n" correctly in Tiktokenizer, it should look something like this:
Are you sure you are using the updated Node.js snippet? It should be
function getChatGPTEncoding(
messages: { role: string; content: string; name: string }[],
model: "gpt-3.5-turbo" | "gpt-4" | "gpt-4-32k"
) {
const isGpt3 = model === "gpt-3.5-turbo";
const encoder = encoding_for_model(model, {
"<|im_start|>": 100264,
"<|im_end|>": 100265,
"<|im_sep|>": 100266,
});
const msgSep = isGpt3 ? "\n" : "";
const roleSep = isGpt3 ? "\n" : "<|im_sep|>";
const serialized = [
messages
.map(({ name, role, content }) => {
return `<|im_start|>${name || role}${roleSep}${content}<|im_end|>`;
})
.join(msgSep),
`<|im_start|>assistant${roleSep}`,
].join(msgSep);
return encoder.encode(serialized, "all");
}
On my side of things, I'm getting 21 as the token count, both on Node and on Tiktokenizer, which should match the API behavior.
all right it works fine! This means that the cookbook is not up-to-date with current tokenizer (missing the roleSep
maybe? Closing then, thank you very much for your help!
When using the OpenAI api with model
"gpt-3.5-turbo-0301"
I have for the prompt"Correct the spelling and grammar\n\nShe no went to the market."
this output usage:while the module:
gives me out 14 tokens:
as it would follow the
"text-davinci-003"
encoding, that in-fact when used in the api gives me that number of tokens for that prompt: