dqbd / tiktoken

JS port and JS/WASM bindings for openai/tiktoken
MIT License
648 stars 49 forks source link

What does free actually do? #72

Open disbelief opened 9 months ago

disbelief commented 9 months ago

Hi, thanks for the javascript port of tiktoken it's been extremely helpful.

I'm trying to figure out what free actually does and more importantly: when it should be used. I assume it performs some teardown and releases any memory reserved by the encoder instance, but I can't find any documentation or source code.

My use case is in an AWS lambda environment, and so I was wondering if it's fine to reuse the same encoder instance multiple times rather than instantiate a new one on every request and then free it after?

This would reduce the overhead of setting up a new encoder every request. Is that advisable or is it better to just instantiate and free every time an encoder is used?

danny-avila commented 9 months ago

@disbelief I agree we need more guidance, as I'm not entirely sure how it works either. I've only come to an answer based on my own usage. Maybe @dqbd can shed some light here and correct me.

I recommend if you are re-using the encoder instance, you should be calling the free method. Failure to do so, at least on my end, leads to weird behavior that has crashed my node process when re-using the same instance after an extended period of time. If you are not re-using, it may or may not get garbage collected (I believe it should), so it seems safe to use free in this case as well.

We can see that free is invoked after all tests: https://github.com/dqbd/tiktoken/blob/072dd12962cabeca67c5088e3d8a8d006af19482/js/test/compatibility.test.ts#L9

And we are reminded to call it after the encoder is not used: https://github.com/dqbd/tiktoken/blob/072dd12962cabeca67c5088e3d8a8d006af19482/README.md?plain=1#L42

The tiktokenizer playground frees the encoder after every request, which is also interesting to note: https://github.com/search?q=repo%3Adqbd%2Ftiktokenizer%20free&type=code

I found a comment somewhat related but it is a bit confusing because it implies you shouldn't be creating new instances repeatedly 'and/or' free should be called:

One possible culprit could be the creation of a new Tiktoken instance every time getTokenCountForMessages is invoked. If possible, try to reuse the same Tiktoken instance obtained from getencoding and/or call tokenizer.free() after you're done with computing. source: https://github.com/dqbd/tiktoken/issues/8#issuecomment-1459063969_

I found re-creating instances repeatedly is mainly an issue because it slows down the rate of encodings.

so TL;DR based on my usage, here's what I've done and haven't had issues when counting tokens in my app without much slowdown:

This resetting behavior has the added benefit of 'capping' resource usage, as a series of encoding requests without any buffer can be resource-intensive, and the intermittent reset forces the re-initialization period before too many are processed all at once. The introduced bottleneck is not too much of a blocker as it's within reason of OpenAI rate limits presuming you are counting before making generation requests, but you should adjust as necessary for your scale or use case.

disbelief commented 9 months ago

Thanks @danny-avila your experience here is much appreciated. I hope @dqbd can shed some more light, but in the meantime I'll add some extra instrumentation to keep an eye on things.

dqbd commented 9 months ago

Hello @disbelief and @danny-avila!

Sorry for the delay, hopefully I can shed some more light into free(). The tiktoken (and @dqbd/tiktoken package for that matter) uses a WASM binary, which works differently with memory when compared to JS. In WASM world, the user/developer is responsible for allocation and, more importantly, deallocation of objects when we're done with using them, whereas in JS world, the garbage collector does the heavy lifting for us, clearing objects when they are unused.

In the case of tiktoken, repeated invocation of new Tiktoken() etc. will allocate the memory for an encoder, which, if not cleared, will take up the available memory until OOM exceptions occur or the encoder fails to encode.

However, the issue and workaround with batching 25 requests does seem to point to a memory leak of sorts, will investigate further, thanks for for flagging @danny-avila!

b0o commented 9 months ago

@dqbd thanks for the info. To further clarify, it should be okay to use a single instance of Tiktoken for the lifetime of a server, right?

dqbd commented 9 months ago

Hey @b0o! Yep, that should be the case. Any issues arising from that would be considered a bug.

danny-avila commented 9 months ago

Hi @dqbd, thanks for your response. So are you saying free should only be called when we mean to get rid of the instance?

In this case here, are you only freeing the encoder when a new one is selected? https://github.com/dqbd/tiktokenizer/blob/cea57c454f38001a91873c944cee6a9b8e2a0610/src/pages/index.tsx#L78

Thanks again, would just like added clarity.

bhavesh-chaudhari commented 8 months ago

Hi @dqbd, thanks for your response. So are you saying free should only be called when we mean to get rid of the instance?

In this case here, are you only freeing the encoder when a new one is selected? https://github.com/dqbd/tiktokenizer/blob/cea57c454f38001a91873c944cee6a9b8e2a0610/src/pages/index.tsx#L78

Thanks again, would just like added clarity.

Hi @danny-avila, thanks for your comment above. Have you figured out when and how to use free() properly? After going through your comment, I am just wondering if using encoding.encode(my_input_text) too many times will cause my server to crash.

I'm using this library in my express.js API to compute the number of tokens in some input text, assisting me in trimming the text efficiently. For every API request, I have to execute encoding.encode(my_input_text) approximately 30-40 times.

import { encoding_for_model } from "@dqbd/tiktoken";

// Create the encoder
const enc = encoding_for_model("gpt-3.5-turbo");

function countTokens(text) {
    return enc.encode(text).length;
}

// this function is invoked when an api is called
const trimDataForGPT = (summary) => {

    // This loop runs 10+ times
    for (let i = 0; i < summary.length; i++) {
        ...
        const estimatedContentTokens = countTokens(someText);
        ...
    }

    // This loop runs 30+ times
    for (let i = 0; i < combinedStrings.length; i++) {
        ...
        const estimatedExplanationTokens = countTokens(someText);
        ...
    }

    return combinedStrings.join("\n");
};

So as you quoted above "Create an encoder instance and re-use it up to 25 encodings", do I need to do this?

danny-avila commented 8 months ago

Hi @danny-avila, thanks for your comment above. Have you figured out when and how to use free() properly? After going through your comment, I am just wondering if using encoding.encode(my_input_text) too many times will cause my server to crash.

I'm using this library in my express.js API to compute the number of tokens in some input text, assisting me in trimming the text efficiently. For every API request, I have to execute encoding.encode(my_input_text) approximately 30-40 times.

Your server can crash just from requiring too many resources, which encoding will do that. Using free every X encodes (batching) will block the thread and "slow down" the series of requests, which may help prevent spiking in your resource usage.

Here's a snippet of my test script where I figured this out:

  for (let i = 0; i < iterations; i++) {
    try {
      console.log(`Iteration ${i}`);
      const client = new OpenAIClient(apiKey, clientOptions);

      client.getTokenCount(text); // uses this tiktoken library
      // const encoder = client.constructor.getTokenizer('cl100k_base');
      // console.log(`Iteration ${i}: call encode()...`);
      // encoder.encode(text, 'all');
      // encoder.free();

      const memoryUsageDuringLoop = process.memoryUsage().heapUsed;
      const percentageUsed = (memoryUsageDuringLoop / maxMemory) * 100;
      printProgressBar(percentageUsed);

      if (i === iterations - 1) {
        console.log(' done');
        // encoder.free();
      }
    } catch (e) {
      console.log(`caught error! in Iteration ${i}`);
      console.log(e);
    }
  }

So as you quoted above "Create an encoder instance and re-use it up to 25 encodings", do I need to do this?

Needing to do this will depend on your own tests.

You can find my full test here: https://github.com/danny-avila/LibreChat/blob/b3aac97710ab9680046eb8089c5fcd4456bd2988/api/app/clients/specs/OpenAIClient.tokens.js

bhavesh-chaudhari commented 8 months ago

Thank you for the response @danny-avila. That's an interesting test you have written. I will perform some similar test in my case to check if things will work as expected.