Open tinco opened 2 months ago
Agree would be nice to have this more flexible. async_openai
already implements backoff, https://docs.rs/async-openai/latest/async_openai/struct.Client.html#method.with_backoff.
We could certainly provide a nicer api for it, with 3.5 it's, unless you're max tier, probably a given that it needs to be tuned.
@tinco backoff
has a builder https://docs.rs/backoff/0.4.0/backoff/exponential/struct.ExponentialBackoffBuilder.html
A challenge with a any rate limit pattern is that there is also a limit on tokens, so something like tiktoken would need to be used per request. This would differ per llm. There's https://github.com/boinkor-net/governor/tree/master/governor however, so a lot of the hard basics in rate limiters could be skipped.
Is your feature request related to a problem? Please describe. Transformers that use LLM calls suffer from overloading their endpoints, resulting in errors like these:
Describe the solution you'd like The LLM client needs to maintain a connection pool and apply adequate backpressure so that the pipeline does not overload the LLM endpoint.
Describe alternatives you've considered Reducing concurrency or adding sleep would be suboptimal and not adaptive to changes in rate limits or hardware.
Additional context After about 45 minutes hammering OpenAI, they close the connection: