How to figure out rate limits?

Calamari commented 11 months ago

For OpenAI they specify rate limits here. They add fields to the header to show, how many tokens are still left. To build something that respects those limits and retries after the limit hast been reset, it would be great to have those in the response some. I quickly searched in the code but could not find anything. Is there currently a way to handle this?

brainlid commented 11 months ago

Hi @Calamari, no, there is currently is not a way to surface the rate information or the current token usage. The token limit one is a general think that applies to all LLMs and should be implemented. The token limit is per model too.

I haven't looked into this much at this point.

Calamari commented 11 months ago

I am not quite sure about other models as OpenAI's but wouldn't it be a relatively easy solution to add a struct containing either the real response or the headers of that response as fourth element of the result tuple of the LLMChains run method? I maybe can conjure up a PR if that is way that you think is viable.

brainlid commented 11 months ago

Here's what I mean by the model limits varying. Heads up: I'm mostly talking out loud here as I think through it too.

Here's the details on ChatGPT's 3.5 models:

https://platform.openai.com/docs/models/gpt-3-5

Notice they are 4K or 16K with one legacy of 8K.

Then the ChatGPT 4 models:

https://platform.openai.com/docs/models/gpt-4

Are 8K or 32K.

The cost for using the larger limit models is higher too.

The idea of LangChain is to abstract away some of the differences between different models so a config change swaps us to a different model. So the information about the token limit is relative. It's more about "how many tokens do I have left?"

I'd like to review how this is managed in the JS or Python LangChain too, since they've had more time to think about it and what's actually helpful.

Calamari commented 11 months ago

I was also thinking along the lines of the meta info of "how many tokens are left and when do they reset". At least for ChatGPT they say, they provide those info as header parameters in the response. And I think that would make sense to somehow pass this through to the caller as well, so they can put in some form of rate limiting in place. As far as I can see, right now, if you make a call that brings you over the limit, you don't even get to know when it would reset.

brainlid commented 11 months ago

I looked into the JS version and they don't have anything documented at least. The Python version docs are much more complete here and I like their approach.

https://python.langchain.com/docs/modules/model_io/models/llms/token_usage_tracking

The caller can provide a callback to get that information. In an Elixir world, passing in an anonymous callback function could be all that's needed. Then after a call to the LLM, the callback fires with the information in a struct format.

Here's the example of the Python result information:

    Tokens Used: 42
        Prompt Tokens: 4
        Completion Tokens: 38
    Successful Requests: 1
    Total Cost (USD): $0.00084

brainlid commented 11 months ago

But this still doesn't tell me what I want to know, which is, "given the model I'm using, how many tokens do I have left?"

That becomes left up to me, the caller, to figure out.

Calamari commented 11 months ago

A callback sounds like a nice idea. Looking at the API docs at least for OpenAI this information about, how many tokens are left is returned in the header as x-ratelimit-remaining-tokens and also the x-ratelimit-reset-requests so one can know when to schedule the next call, when the tokens that are left are too low. Those are infos we can present though that callback.

brainlid commented 11 months ago

There are two different types of limits being talked about here.

The max tokens allowed for the conversation (what the callback talks about)
The ratelimit tokens (what your API link talks about)

The ratelimit tokens are separate and focus on the number of tokens-per-minute that the callers's account is allowed to make. That count and limit resets based on time. The first one is a fixed count based on the conversation size.

The time-based rate limits are something a server might want to track and force their own limits on their users across requests. The size-based limits are hard limits and those force the need to summarize or start new conversations.

Calamari commented 11 months ago

Yes, I am currently interested in the time-based rate limits. Since there is currently no way to track them, so the server cannot throttle anything or doesn't know when to retry.

krrishdholakia commented 10 months ago

Hey @Calamari, I'm the maintainer of LiteLLM our Router (used for load balancing across different openai/azure/etc. endpoints) uses time-based limits as way to timeout + retry requests - https://github.com/BerriAI/litellm/blob/9b5f52ae635594aeba3cb6f2a3f81dd3da03e169/litellm/router.py#L190

Let me know how our implementation can be improved. Attaching sample code for quick start below.

from litellm import Router

model_list = [{ # list of model deployments 
    "model_name": "gpt-3.5-turbo", # model alias 
    "litellm_params": { # params for litellm completion/embedding call 
        "model": "azure/chatgpt-v-2", # actual model name
        "api_key": os.getenv("AZURE_API_KEY"),
        "api_version": os.getenv("AZURE_API_VERSION"),
        "api_base": os.getenv("AZURE_API_BASE")
    }
}, {
    "model_name": "gpt-3.5-turbo", 
    "litellm_params": { # params for litellm completion/embedding call 
        "model": "azure/chatgpt-functioncalling", 
        "api_key": os.getenv("AZURE_API_KEY"),
        "api_version": os.getenv("AZURE_API_VERSION"),
        "api_base": os.getenv("AZURE_API_BASE")
    }
}, {
    "model_name": "gpt-3.5-turbo", 
    "litellm_params": { # params for litellm completion/embedding call 
        "model": "gpt-3.5-turbo", 
        "api_key": os.getenv("OPENAI_API_KEY"),
    }
}]

router = Router(model_list=model_list)

# openai.ChatCompletion.create replacement
response = await router.acompletion(model="gpt-3.5-turbo", 
                messages=[{"role": "user", "content": "Hey, how's it going?"}])

print(response)

brainlid commented 8 months ago

@Calamari Just a quick follow-up. I'm not sure how to best support this feature. I'm also thinking of Bumblebee based LLMs. I've been in talks with that team about getting token counts from that as well. Just letting you know that I'm tracking with it but not actively working on implementing it myself at this time.

brainlid / langchain

How to figure out rate limits? #31