Clarification Needed on max_tokens Value

Elijas commented 8 months ago

Hello,

Firstly, I noticed a discrepancy in the max_tokens value—15 in the source code versus 175 in the documentation examples. Could you advise on how this value is derived?

Thank you for your support.

Youssefbenhammouda commented 8 months ago

Hi,

The max tokens parameter does not affect the functionality of the package, it is used to calculate the number of the tokens of the request, this algorithm was taken from a notebook that was published earlier by OpenAI. It depends on how much you expect the output to be.

Please note that I didn't update the package for months, and it may be that OpenAI changed the way how this is calculated but the package can still be helpful to limit the request but not precisely.

So I suggest you test if the package is behaving like you are expecting first :)

Elijas commented 8 months ago

Hi,

The max tokens parameter does not affect the functionality of the package, it is used to calculate the number of the tokens of the request, this algorithm was taken from a notebook that was published earlier by OpenAI. It depends on how much you expect the output to be.

Please note that I didn't update the package for months, and it may be that OpenAI changed the way how this is calculated but the package can still be helpful to limit the request but not precisely.

So I suggest you test if the package is behaving like you are expecting first :)

Thanks for a quick reply!

I see, so from what I understand max_tokens is basically "guess how many output tokens the model will generate and this will count towards the token limit".

So for example, if OpenAI is limited to 200 output tokens, then setting max_tokens to 200 would account for the worst possible case.

Ideally, after the generation is complete, that value (200) would be replaced in the limiter with the actual output token count, I suppose. I know that OpenAI also gives "used tokens" and "available tokens left" in the response headers, but not sure if other LLM vendors do this too, so tracking the tokens on client-side allows for more portable code.

Either way, thanks for the library, provides a quick way to get started 👍 🚀

Elijas commented 8 months ago

I see that the code implements a cooldown mechanism. If the rate limit is exceeded, it enforces a pause (cooldown) for the entire duration of the fixed window before attempting to process further requests.
I see that you're using Redis incrby

This means that such an approach should theoretically be possible

Reserve some amount of tokens (e.g. average output consumption, or 99th percentile of usual responses etc., if max_tokens is not specifically set)
Get completion and calculate actual consumption
Reserve actual - reserved amount of tokens. If the value is positive then additional tokens are taken into account. If the value is negative, then unused tokens are freed.

E.g. (a very rough hacky example)

reserved_tokens = 300
with limiter.limit(messages=..., max_tokens=reserved_tokens)
    response = ... # make request

# Adjust reserved tokens based on actual consumption 
actual_output_tokens = get_output_token_count(response)
adjustment = actual_output_tokens - reserved_tokens
await limiter.redis.incrby(f"{limiter.model_name}_api_tokens", adjustment)

Of course, this solution has plenty of unaccounted edge cases, this was just a back-of-a-napkin example

--

just a few raw thoughts regarding the limiter 👍

Elijas commented 8 months ago

I will add this for future reference. The following draft-code seems to work great so far, it only adjusts the actual consumption value if it didn't expire yet (as a transaction to avoid race conditions).

reserved_tokens = 300

with limiter.limit(messages=..., max_tokens=reserved_tokens)
    response = ... # make request
result = response.choices[0].message.content

actual_tokens = len(tiktoken.get_encoding("cl100k_base").encode(result))
adjustment = actual_tokens - reserved_tokens
used_tokens = await self.incr_if_exists(adjustment)

async def incr_if_exists(self, adjustment: int) -> str:
    lua_script = """
    if redis.call('exists', KEYS[1]) == 1 then
        return redis.call('incrby', KEYS[1], ARGV[1])
    else
        return nil
    end
    """
    key = f"{limiter.model_name}_api_tokens"
    adjustment_value = adjustment
    return await self._limiter.redis.eval(
        lua_script, 1, key, str(adjustment_value)
    )

(Sorry for the code quality, It's was a very quick-n-dirty draft)

Youssefbenhammouda / openai-ratelimiter

Clarification Needed on max_tokens Value #5