(feat) Configure fallback llm's in case of rate limit errors

dagelf commented 4 months ago

What problem or use case are you trying to solve? If a rate limit is hit it just gets stuck in a loop hammering the api

Describe the UX of the solution you'd like Provide a temporary modal popup that says rate limit is hit, that provides an option to automatically switch to the next api in the list. The popup should automatically go away when the rate limit expires

Additional context

opendevin:ERROR: agent_controller.py:110 - GroqException - Error code: 429 - {'error': {'message': 'Rate limit reached for model `llama3-8b-8192` in organization `org_xxx` on tokens per minute (TPM): Limit 7500, Used 16673, Requested ~3658. Please try again in 1m42.653s. Visit https://console.groq.com/docs/rate-limits for more information.', 'type': 'tokens', 'code': 'rate_limit_exceeded'}}

There's more here too, we can keep track of tokens and not waste expensive models on cheap operations. Will open another issue for that.

dagelf commented 4 months ago

Also, it shouldn't think that each error is a new step..... ie. if needs to distinguish between errors calling the completion API itself, and errors resulting from the code being worked on.

barsuna commented 4 months ago

I've hit this also recently with Anthropic API. They have requests/minute, tokens/minute, requests/day limits and OpenDevin quickly (within minute) hit tokens/minute. Since in this case, the rate is known perhaps we can support configuring these limits and sleeping on the api request somewhere until the rate is respected...

enyst commented 4 months ago

@barsuna Currently you can configure a few options... whose documentation I cannot find anymore, maybe it got lost somehow, will fix. They are: config.toml

...
[llm]
num_retries=5
retry_min_wait=3
retry_max_wait=60

You can add them in the config.toml file and tweak them as you want. The minimum and maximum wait are in seconds. They represent how long to wait once it hits the rate limit. You may want to make min wait relatively high, unlike the default of 3 seconds, for example?

barsuna commented 4 months ago

Thanks @enyst! I haven't found a way with litellm to shape outgoing call rates (maybe there is a way to make it act smartly on code 429 / 529?), so i have prototyped external proxy that does shaping of the calls to respect the rate-limits. So rate-limiting is not an issue anymore, but title says 'elegantly', so i guess that is still to be addressed - 2 proxies is difficult to call elegant.

INFO:     172.17.0.2:43720 - "POST /api/generate HTTP/1.1" 200 OK
2024-05-15T22:52:04.260810 waiting due to excessive TPM (20200)
2024-05-15T22:52:08.262181 excessive TPM cleared
2024-05-15T22:52:08.262292 calling https://api.anthropic.com/v1/messages
2024-05-15T22:52:40.647955 response code 200
tokens_in/out: 2430/924
INFO:     172.17.0.2:36076 - "POST /api/generate HTTP/1.1" 200 OK
2024-05-15T22:52:40.670185 calling https://api.anthropic.com/v1/messages
2024-05-15T22:52:48.975432 response code 200

As usual, this just revealed next problem - tokens burn really fast. Need to get back to tuning some local model to produce something resembling claude3/gpt4

enyst commented 4 months ago

@barsuna Interesting!

maybe there is a way to make it act smartly on code 429

That's what it's doing, and isn't litellm, it's in opendevin. We have those configuration options you can use. On 429, it starts waiting, and you can set for how long. The default is a minimum of 3 seconds, which with Anthropic it's not useful, it's too low. I found that values like

retry_min_wait=20
num_retries=15

are better. It literally makes Claude usable at relatively low tiers, otherwise I couldn't get anything done.

barsuna commented 3 months ago

got it now, thanks @enyst

0xWick commented 3 months ago

For me, It crashed on the Message Token Limit Error. The tool should handle this response from OpenAI so it doesnt crash and just shows a warning that the request exceeded the limit.

krism142 commented 2 months ago

@barsuna Currently you can configure a few options... whose documentation I cannot find anymore, maybe it got lost somehow, will fix. They are: config.toml
...
[llm]
num_retries=5
retry_min_wait=3
retry_max_wait=60
You can add them in the config.toml file and tweak them as you want. The minimum and maximum wait are in seconds. They represent how long to wait once it hits the rate limit. You may want to make min wait relatively high, unlike the default of 3 seconds, for example?

Any idea if these are able to be modified if running the docker container? or would I just need to include a config.toml file in my workspace directory?

barsuna commented 2 months ago

Apologies for lag @krism142, i havent gone this route to set rate-limits, but according to https://docs.all-hands.dev/modules/usage/llms one could set them using env variables which can be passed when starting container - should help avoid persistency issues or need for patching containers.

enyst commented 2 months ago

Yes, you can set them as environment vars, using -e just like others are set in the docker command. It just needs the uppercase variant then, like this: -e LLM_NUM_RETRIES=15 and -e LLM_RETRY_MIN_WAIT=20.

github-actions[bot] commented 1 month ago

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

tobitege commented 1 week ago

Just for info: yesterday's merge of #3678 fixes the above mentioned issue by @0xWick so that upon an error the agent no longer becomes broken (unrelated to retries/timing values here)

tobitege commented 1 week ago

This should be fixed now by #3729 , closing this for now.

tobitege commented 1 week ago

Oops, I overlooked the detail about specifying fallback llm's for this. Reopened this. :)

All-Hands-AI / OpenHands

(feat) Configure fallback llm's in case of rate limit errors #1263