dqbd / tiktoken

JS port and JS/WASM bindings for openai/tiktoken
MIT License
652 stars 48 forks source link

TypeError: The encoded data was not valid for encoding utf-8 #8

Closed waylaidwanderer closed 1 year ago

waylaidwanderer commented 1 year ago
TypeError: The encoded data was not valid for encoding utf-8
    at TextDecoder.decode (node:internal/encoding:433:16)
    at getStringFromWasm0 (/home/joel/projects/node-chatgpt-api/node_modules/@dqbd/tiktoken/dist/node/_tiktoken.js:108:30)
    at module.exports.__wbindgen_error_new (/home/joel/projects/node-chatgpt-api/node_modules/@dqbd/tiktoken/dist/node/_tiktoken.js:414:27)
    at wasm://wasm/00fdfe62:wasm-function[29]:0x2b7f4
    at wasm://wasm/00fdfe62:wasm-function[171]:0x5403e
    at Tiktoken.encode (/home/joel/projects/node-chatgpt-api/node_modules/@dqbd/tiktoken/dist/node/_tiktoken.js:268:18)
    at file:///home/joel/projects/node-chatgpt-api/src/ChatGPTClient.js:446:45
    at Array.map (<anonymous>)
    at file:///home/joel/projects/node-chatgpt-api/src/ChatGPTClient.js:444:65
    at Array.map (<anonymous>) {
  code: 'ERR_ENCODING_INVALID_ENCODED_DATA'

I'm getting this error sometimes but there aren't any weird unicode characters in the input or anything else weird, just regular English text.

Trying to continue using it gives me this error:

Error: Invalid encoding
    at module.exports.__wbindgen_error_new (/home/joel/projects/node-chatgpt-api/node_modules/@dqbd/tiktoken/dist/node/_tiktoken.js:414:17)
    at wasm://wasm/00fdfe62:wasm-function[30]:0x2bd1f
    at wasm://wasm/00fdfe62:wasm-function[235]:0x5bafb
    at module.exports.get_encoding (/home/joel/projects/node-chatgpt-api/node_modules/@dqbd/tiktoken/dist/node/_tiktoken.js:160:14)
    at ChatGPTClient.getTokenCountForMessages (file:///home/joel/projects/node-chatgpt-api/src/ChatGPTClient.js:439:27)
    at ChatGPTClient.buildPrompt (file:///home/joel/projects/node-chatgpt-api/src/ChatGPTClient.js:343:50)
    at ChatGPTClient.sendMessage (file:///home/joel/projects/node-chatgpt-api/src/ChatGPTClient.js:237:34)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async Object.<anonymous> (file:///home/joel/projects/node-chatgpt-api/bin/server.js:113:18)
waylaidwanderer commented 1 year ago

Here is my function used to count tokens:

    /**
     * Algorithm adapted from "6. Counting tokens for chat API calls" of
     * https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb
     * @param {*[]} messages
     */
    static getTokenCountForMessages(messages) {
        // Get the encoding tokenizer
        const tokenizer = get_encoding('cl100k_base');

        // Map each message to the number of tokens it contains
        const messageTokenCounts = messages.map((message) => {
            // Map each property of the message to the number of tokens it contains
            const propertyTokenCounts = Object.entries(message).map(([key, value]) => {
                // Count the number of tokens in the property value
                const numTokens = tokenizer.encode(value).length;

                // Subtract 1 token if the property key is 'name'
                const adjustment = (key === 'name') ? 1 : 0;
                return numTokens - adjustment;
            });

            // Sum the number of tokens in all properties and add 4 for metadata
            return propertyTokenCounts.reduce((a, b) => a + b, 4);
        });

        // Sum the number of tokens in all messages and add 2 for metadata
        return messageTokenCounts.reduce((a, b) => a + b, 2);
    }
waylaidwanderer commented 1 year ago

After running into this issue again, I tried exiting the script and resuming the conversation after starting the script again. I gave it the exact same prompt but the issue didn't happen again, so I'm not sure what's going on here. Doesn't seem like it's something to do with the input string.

dqbd commented 1 year ago

May I ask on what kind of NodeJS version are you using? And does this issue clear itself after restarting the server? @waylaidwanderer

waylaidwanderer commented 1 year ago

I'm on Node 16, but the issue occurred on Node 18 as well, IIRC. Restarting the script clears the issue temporarily.

dqbd commented 1 year ago

One possible culprit could be the creation of a new Tiktoken instance every time getTokenCountForMessages is invoked. If possible, try to reuse the same Tiktoken instance obtained from get_encoding and/or call tokenizer.free() after you're done with computing.

waylaidwanderer commented 1 year ago

I did refactor it later on so that get_encoding is only called once, which seems to have resolved the issue (though I haven't had a chance to test it thoroughly). I didn't know about tokenizer.free() though which I'll keep in mind for next time.

dqbd commented 1 year ago

Will close the issue for now, lmk if other issues arise