bosun-ai / swiftide

Fast, streaming indexing and query library for AI (RAG) applications, written in Rust
https://swiftide.rs
MIT License
120 stars 8 forks source link

Ergonomic way of dealing with LLM rate limits #142

Open tinco opened 2 months ago

tinco commented 2 months ago

Is your feature request related to a problem? Please describe. Transformers that use LLM calls suffer from overloading their endpoints, resulting in errors like these:

2024-07-10T17:47:10.993661Z  WARN ingestion_pipeline.run:transformers.metadata_qa_code:prompt: async_openai::client: Rate limited: Rate limit reached for gpt-3.5-turbo in organization org-Gna8CW74JAUnoOFeI6Ivvn03 on tokens per min (TPM): Limit 80000, Used 79866, Requested 258. Please try again in 93ms. Visit https://platform.openai.com/account/rate-limits to learn more.

Describe the solution you'd like The LLM client needs to maintain a connection pool and apply adequate backpressure so that the pipeline does not overload the LLM endpoint.

Describe alternatives you've considered Reducing concurrency or adding sleep would be suboptimal and not adaptive to changes in rate limits or hardware.

Additional context After about 45 minutes hammering OpenAI, they close the connection:

2024-07-10T17:47:14.520091Z DEBUG ingestion_pipeline.run{total_nodes=283}:transformers.metadata_qa_text:prompt:Connection{peer=Client}: h2::codec::framed_write: send frame=Headers { stream_id: StreamId(19999), flags: (0x4: END_HEADERS) }
2024-07-10T17:47:14.520282Z DEBUG ingestion_pipeline.run{total_nodes=283}:transformers.metadata_qa_text:prompt:Connection{peer=Client}: h2::codec::framed_write: send frame=Data { stream_id: StreamId(19999), flags: (0x1: END_STREAM) }
2024-07-10T17:47:14.520366Z DEBUG ingestion_pipeline.run:transformers.metadata_qa_code:prompt: hyper_util::client::legacy::pool: reuse idle connection for ("https", api.openai.com)
2024-07-10T17:47:14.521003Z DEBUG ingestion_pipeline.run{total_nodes=283}:transformers.metadata_qa_text:prompt:Connection{peer=Client}: h2::codec::framed_write: send frame=Headers { stream_id: StreamId(20001), flags: (0x4: END_HEADERS) }
2024-07-10T17:47:14.521151Z DEBUG ingestion_pipeline.run{total_nodes=283}:transformers.metadata_qa_text:prompt:Connection{peer=Client}: h2::codec::framed_write: send frame=Data { stream_id: StreamId(20001), flags: (0x1: END_STREAM) }
2024-07-10T17:47:14.532274Z DEBUG ingestion_pipeline.run{total_nodes=283}:transformers.metadata_qa_text:prompt:Connection{peer=Client}: h2::codec::framed_read: received frame=GoAway { error_code: NO_ERROR, last_stream_id: StreamId(19999) }
2024-07-10T17:47:14.532655Z DEBUG ingestion_pipeline.run:transformers.metadata_qa_code:prompt: hyper_util::client::legacy::pool: reuse idle connection for ("https", api.openai.com)
2024-07-10T17:47:14.532940Z ERROR ingestion_pipeline.run:transformers.metadata_qa_code:prompt: swiftide::integrations::openai::simple_prompt: error=http error: error sending request for url (https://api.openai.com/v1/chat/completions)
2024-07-10T17:47:14.534130Z DEBUG ingestion_pipeline.run{total_nodes=283}:transformers.metadata_qa_text:prompt:Connection{peer=Client}: h2::codec::framed_write: send frame=Reset { stream_id: StreamId(19999), error_code: CANCEL }
thread 'main' panicked at src/main.rs:37:10:
Could not load documentation: http error: error sending request for url (https://api.openai.com/v1/chat/completions)
timonv commented 2 months ago

Agree would be nice to have this more flexible. async_openai already implements backoff, https://docs.rs/async-openai/latest/async_openai/struct.Client.html#method.with_backoff.

We could certainly provide a nicer api for it, with 3.5 it's, unless you're max tier, probably a given that it needs to be tuned.

timonv commented 2 months ago

@tinco backoff has a builder https://docs.rs/backoff/0.4.0/backoff/exponential/struct.ExponentialBackoffBuilder.html

timonv commented 2 months ago

A challenge with a any rate limit pattern is that there is also a limit on tokens, so something like tiktoken would need to be used per request. This would differ per llm. There's https://github.com/boinkor-net/governor/tree/master/governor however, so a lot of the hard basics in rate limiters could be skipped.