dqbd / tiktoken

JS port and JS/WASM bindings for openai/tiktoken
MIT License
720 stars 54 forks source link

It doesn't support new model "o1-mini" and "o1-preview" #120

Closed Talented-Business closed 1 month ago

Talented-Business commented 2 months ago

Hi openai devs,

how can I count tokens for o1-preview and o1-mini?

Thanks in advance!

tmlxrd commented 1 month ago

Hi, I’m using the tiktoken library to count tokens for the gpt-4o-mini model. However, I’ve noticed a discrepancy between my token counts and the counts returned by the OpenAI API. It seems that tiktoken doesn’t fully support this new model yet, and the tokenization may differ slightly. Is there a plan to officially support the gpt-4o-mini in tiktoken?

Thanks in advance!

tmlxrd commented 1 month ago

Hi openai devs,

how can I count tokens for o1-preview and o1-mini?

Thanks in advance!

Here’s my example code:

const countTokens = (messages: any[], model: TiktokenModel): number => { const enc = encoding_for_model(model); // Tokenizer for the model let tokenCount = 0;

// Iterating over each message and counting tokens for 'role' and 'content'
messages.forEach((message) => {
    tokenCount += enc.encode(message.role).length;   // Count role tokens
    tokenCount += enc.encode(message.content).length; // Count content tokens
});

return tokenCount;

};

const messages = [ { role: 'system', content: instructions }, { role: 'user', content: userContent } ];

const model: TiktokenModel = "gpt-4o-mini"; const tokenCountInput = countTokens(messages, model);

dqbd commented 1 month ago

Hello! Will keep monitoring https://github.com/openai/tiktoken/issues/337 to see if there are any changes w.r.t. the underlying token map.

dqbd commented 1 month ago

@tmlxrd Just counting role and content is not necessarily enough. You need to also include the tokens which are used to separate the messages: see dqbd/tiktokenizer

tmlxrd commented 1 month ago

@tmlxrd Just counting role and content is not necessarily enough. You need to also include the tokens which are used to divide the messages: see dqbd/tiktokenizer

Thank you for your answer! I do this because I get a smaller number of tokens than openai returns in the api response

I got 1708 incoming tokens in the big text and 1717 in the response from openai. It's a small difference, but I don't understand what it's about, so I added two roles

UPD: Thank you for the link to the feature. It works better now, but there are discrepancies with the answer from openai

NuoJohnChen commented 1 month ago

Do 'o1-mini' and 'o1-preview' still use the cl100k_base vocabulary?

tmlxrd commented 1 month ago

Do 'o1-mini' and 'o1-preview' still use the cl100k_base vocabulary?

Hi. Unfortunately, I don't know that. Share the answer if you find the information

dqbd commented 1 month ago

Got clarification with the latest tiktoken@0.8.0 release, updating here as well