Closed waylaidwanderer closed 1 year ago
Here is my function used to count tokens:
/**
* Algorithm adapted from "6. Counting tokens for chat API calls" of
* https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb
* @param {*[]} messages
*/
static getTokenCountForMessages(messages) {
// Get the encoding tokenizer
const tokenizer = get_encoding('cl100k_base');
// Map each message to the number of tokens it contains
const messageTokenCounts = messages.map((message) => {
// Map each property of the message to the number of tokens it contains
const propertyTokenCounts = Object.entries(message).map(([key, value]) => {
// Count the number of tokens in the property value
const numTokens = tokenizer.encode(value).length;
// Subtract 1 token if the property key is 'name'
const adjustment = (key === 'name') ? 1 : 0;
return numTokens - adjustment;
});
// Sum the number of tokens in all properties and add 4 for metadata
return propertyTokenCounts.reduce((a, b) => a + b, 4);
});
// Sum the number of tokens in all messages and add 2 for metadata
return messageTokenCounts.reduce((a, b) => a + b, 2);
}
After running into this issue again, I tried exiting the script and resuming the conversation after starting the script again. I gave it the exact same prompt but the issue didn't happen again, so I'm not sure what's going on here. Doesn't seem like it's something to do with the input string.
May I ask on what kind of NodeJS version are you using? And does this issue clear itself after restarting the server? @waylaidwanderer
I'm on Node 16, but the issue occurred on Node 18 as well, IIRC. Restarting the script clears the issue temporarily.
One possible culprit could be the creation of a new Tiktoken instance every time getTokenCountForMessages
is invoked. If possible, try to reuse the same Tiktoken instance obtained from get_encoding
and/or call tokenizer.free()
after you're done with computing.
I did refactor it later on so that get_encoding
is only called once, which seems to have resolved the issue (though I haven't had a chance to test it thoroughly). I didn't know about tokenizer.free()
though which I'll keep in mind for next time.
Will close the issue for now, lmk if other issues arise
I'm getting this error sometimes but there aren't any weird unicode characters in the input or anything else weird, just regular English text.
Trying to continue using it gives me this error: