cl100k_base issue - Githubissues

loretoparisi commented 1 year ago

When using the OpenAI api with model "gpt-3.5-turbo-0301" I have for the prompt "Correct the spelling and grammar\n\nShe no went to the market." this output usage:

"usage": {
    "prompt_tokens": 21,
    "completion_tokens": 8,
    "total_tokens": 29
  },

while the module:

const { get_encoding, encoding_for_model }  = require("@dqbd/tiktoken");
const enc = get_encoding("cl100k_base");
const str = "Correct the spelling and grammar\n\nShe no went to the market."
const encoded = enc.encode(str);
for (let token of encoded) {
    var tokenDecoded = (new TextDecoder().decode(enc.decode([token])));
    console.log({ token, string: tokenDecoded })
}

gives me out 14 tokens:

{ token: 34192, string: 'Correct' }
{ token: 279, string: ' the' }
{ token: 43529, string: ' spelling' }
{ token: 323, string: ' and' }
{ token: 32528, string: ' grammar' }
{ token: 271, string: '\n\n' }
{ token: 8100, string: 'She' }
{ token: 912, string: ' no' }
{ token: 4024, string: ' went' }
{ token: 311, string: ' to' }
{ token: 279, string: ' the' }
{ token: 3157, string: ' market' }
{ token: 13, string: '.' }

as it would follow the "text-davinci-003" encoding, that in-fact when used in the api gives me that number of tokens for that prompt:

"usage": {
    "prompt_tokens": 14,
    "completion_tokens": 10,
    "total_tokens": 24
  }

dqbd commented 1 year ago

Hello @loretoparisi , when using Chat Completion API, the (ChatML) message needs be serialized before sending it to the tokenizer. See https://tiktokenizer.vercel.app/ to understand how it (most likely) behaves

loretoparisi commented 1 year ago

So does this means that I have to consider a serialized JSON message?

var str = "Correct the spelling and grammar\n\nShe no went to the market."

    const messageObj = [{
        role: "user",
        content: str
    }];
    str = JSON.stringify(messageObj)

If I look at your app here you do like:

const enc = get_encoding("cl100k_base", {
        "<|im_start|>": 100264,
        "<|im_end|>": 100265,
        "<|im_sep|>": 100266,
        // TODO: very hacky
        // "system name=": 900000,
        // "assistant name=": 900001,
        // "user name=": 900002,
      });
const encoded = enc.encode(str, "all");

but in this case I'm getting 23 tokens, not 21!

dqbd commented 1 year ago

@loretoparisi Use the following code snippet extracted from Tiktokenizer (will most likely include it directly in library later)

note: updated as of 25/03/2023

function getChatGPTEncoding(
  messages: { role: string; content: string; name: string }[],
  model: "gpt-3.5-turbo" | "gpt-4" | "gpt-4-32k"
) {
  const isGpt3 = model === "gpt-3.5-turbo";

  const encoder = encoding_for_model(model, {
    "<|im_start|>": 100264,
    "<|im_end|>": 100265,
    "<|im_sep|>": 100266,
  });

  const msgSep = isGpt3 ? "\n" : "";
  const roleSep = isGpt3 ? "\n" : "<|im_sep|>";

  const serialized = [
    messages
      .map(({ name, role, content }) => {
        return `<|im_start|>${name || role}${roleSep}${content}<|im_end|>`;
      })
      .join(msgSep),
    `<|im_start|>assistant${roleSep}`,
  ].join(msgSep);

  return encoder.encode(serialized, "all");
}

loretoparisi commented 1 year ago

Amazing!

{ token: 100264, string: '<|im_start|>' }
{ token: 882, string: 'user' }
{ token: 198, string: '\n' }
{ token: 34192, string: 'Correct' }
{ token: 279, string: ' the' }
{ token: 43529, string: ' spelling' }
{ token: 323, string: ' and' }
{ token: 32528, string: ' grammar' }
{ token: 271, string: '\n\n' }
{ token: 8100, string: 'She' }
{ token: 912, string: ' no' }
{ token: 4024, string: ' went' }
{ token: 311, string: ' to' }
{ token: 279, string: ' the' }
{ token: 3157, string: ' market' }
{ token: 13, string: '.' }
{ token: 100265, string: '<|im_end|>' }
{ token: 198, string: '\n' }
{ token: 100264, string: '<|im_start|>' }
{ token: 78191, string: 'assistant' }
chars:62 words:11 tokens:20 token/word ratio:1.818 char/token ratio:3.1

We are almost there, maybe it's my fault, now it's 20 tokens. I did

/**
     * 
     * @param {*} messages { role: string; content: string; name: string }
     * @param {*} model "gpt-3.5-turbo" | "gpt-4" | "gpt-4-32k"
     * @returns 
     */
    function getChatGPTEncoding(
        messages = [],
        model = "gpt-3.5-turbo" | "gpt-4" | "gpt-4-32k"
    ) {
        const isGpt3 = model === "gpt-3.5-turbo";

        const encoder = encoding_for_model(model, {
            "<|im_start|>": 100264,
            "<|im_end|>": 100265,
            "<|im_sep|>": 100266,
        });

        const msgSep = isGpt3 ? "\n" : "";
        const roleSep = isGpt3 ? "\n" : "<|im_sep|>";

        const serialized = [
            messages
                .map(({ name, role, content }) => {
                    return `<|im_start|>${name || role}${roleSep}${content}<|im_end|>`;
                })
                .join(msgSep),
            "<|im_start|>assistant",
        ].join(msgSep);
        return { encoder: encoder, encoded: encoder.encode(serialized, "all") };
    }//getChatGPTEncoding

   var str = "Correct the spelling and grammar\n\nShe no went to the market."
    const messages = [{
        role: "user",
        name: "",
        content: str
    }];
    const { encoder, encoded } = getChatGPTEncoding(messages, "gpt-3.5-turbo");
    for (let token of encoded) {
        var tokenDecoded = (new TextDecoder().decode(encoder.decode([token])));
        console.log({ token, string: tokenDecoded })
    }

dqbd commented 1 year ago

It does seem like there was a possibly undocumented change regarding how tokens are actually counted on OpenAI's side of things. (Code obtained from openai/openai-cookbook)

Will investigate further to determine approximate behaviour, but I believe they most likely added an additional roleSep after the last line.

@loretoparisi Updated the code sample above, now it seems to match well 😄

loretoparisi commented 1 year ago

@dqbd thanks, did you released an update on npm? On the app now I see 22 tokens

while in node still getting 20 tokens with the function getChatGPTEncoding as defined before. Installed version from npm was @dqbd/tiktoken@^1.0.2

loretoparisi commented 1 year ago

@dqbd According to the cookbook above I did

/**
     * Returns the number of tokens used by a list of messages.
     * @param {*} messages 
     * @param {*} model 
     * @returns 
     */
    function numTokensFromMessages(messages, model = "gpt-3.5-turbo-0301") {
        var encoding;
        try {
            encoding = encoding_for_model(model)
        } catch (KeyError) {
            console.log("Warning: model not found. Using cl100k_base encoding.")
            encoding = get_encoding("cl100k_base")
        }
        if (model == "gpt-3.5-turbo") {
            console.log("Warning: gpt-3.5-turbo may change over time. Returning num tokens assuming gpt-3.5-turbo-0301.")
            return numTokensFromMessages(messages, model = "gpt-3.5-turbo-0301")
        } else if (model == "gpt-4") {
            console.log("Warning: gpt-4 may change over time. Returning num tokens assuming gpt-4-0314.")
            return numTokensFromMessages(messages, model = "gpt-4-0314")
        } else if (model == "gpt-3.5-turbo-0301") {
            tokens_per_message = 4  // every message follows <|start|>{role/name}\n{content}<|end|>\n
            tokens_per_name = -1  // if there's a name, the role is omitted
        } else if (model == "gpt-4-0314") {
            tokens_per_message = 3
            tokens_per_name = 1
        } else {
            throw new Error(`num_tokens_from_messages() is not implemented for model {model}. See https://github.com/openai/openai-python/blob/main/chatml.md for information on how messages are converted to tokens.`)
        }
        var num_tokens = 0
        for (const message of messages) {
            num_tokens += tokens_per_message
            Object.keys(message).forEach(key => {
                var value = message[key];
                var encoded = encoding.encode(value);
                num_tokens += encoded.length
                if (key == "name") {
                    num_tokens += tokens_per_name
                }
            });
        }
        num_tokens += 3  // every reply is primed with <| start |> assistant <| message |>
        return { encoder: encoding, num_tokens: num_tokens };
    }//numTokensFromMessages

and in fact the count is 20:

var str = "Correct the spelling and grammar\n\nShe no went to the market."
    const messages = [{
        role: "user",
        name: "",
        content: str
    }];
const num_tokens = numTokensFromMessages(messages, "gpt-3.5-turbo")
Warning: gpt-3.5-turbo may change over time. Returning num tokens assuming gpt-3.5-turbo-0301.
// num_tokens=20

while the api still 21! 😢

dqbd commented 1 year ago

@loretoparisi There are various issues as far as I can see:

Make sure you are escaping "\n" correctly in Tiktokenizer, it should look something like this:
Are you sure you are using the updated Node.js snippet? It should be

function getChatGPTEncoding(
  messages: { role: string; content: string; name: string }[],
  model: "gpt-3.5-turbo" | "gpt-4" | "gpt-4-32k"
) {
  const isGpt3 = model === "gpt-3.5-turbo";

  const encoder = encoding_for_model(model, {
    "<|im_start|>": 100264,
    "<|im_end|>": 100265,
    "<|im_sep|>": 100266,
  });

  const msgSep = isGpt3 ? "\n" : "";
  const roleSep = isGpt3 ? "\n" : "<|im_sep|>";

  const serialized = [
    messages
      .map(({ name, role, content }) => {
        return `<|im_start|>${name || role}${roleSep}${content}<|im_end|>`;
      })
      .join(msgSep),
    `<|im_start|>assistant${roleSep}`,
  ].join(msgSep);

  return encoder.encode(serialized, "all");
}

On my side of things, I'm getting 21 as the token count, both on Node and on Tiktokenizer, which should match the API behavior.

loretoparisi commented 1 year ago

all right it works fine! This means that the cookbook is not up-to-date with current tokenizer (missing the roleSep maybe? Closing then, thank you very much for your help!

dqbd / tiktoken

cl100k_base issue #23