Closed pmeier closed 9 months ago
thank you everyone that took the time to get in touch!
In my experience, contributions and engagement are tightly linked to how open the core dev team is to feedback. The ragna
team have been extremely receptive IMO (this issue being a perfect example), thank you for sowing the seeds of good culture! :)
I'm going to start a branch soon to move the task queue to the Python API to see if there are more downsides that I'm currently not seeing. Will post updates here.
Thanks for making this public. IMO It's the right approach to have two branches in parallel - there are so many 'breakthroughs' right now in Gen AI that it feels like a good idea to explore both paths in parallel for as long as possible.
That being said, I'm not fully sure I understand the rationale for removing the queue only from the Python API.
My current understanding is that a queue is useful for CPU-bound tasks in Python. The most common CPU-bound tasks that I am familiar with for RAG applications are i) loading and chunking the documents, ii) computing the embeddings and iii) performing a vector-similarity search or lexical-search. There are other less common (for now!) tasks that I haven't mentioned such as re-ranking, but I'll ignore these. These three sub-tasks are currently (mostly) abstracted behind the prepare()
interface in ragna
. (Of course, all of these become I/O-bound tasks if they are computed on a server elsewhere eg using OpenAI to compute the embeddings.)
For I/O bound tasks and for an application that already uses async methods eg FastAPI web-server, I don't really see the benefit of using a queue (at least for workloads that I have experienced in the past). When interacting with an external API like OpenAI, I don't think there is much benefit to waiting for a response on a worker, as opposed to waiting for a response on the main thread. Disadvantages include issues like #185.
So, my point is, why restrict the queue experimentation branch only to the Python API? Surely the web server/API would also benefit from this. The key for me is distinguishing between I/O bound and CPU bound tasks, and then executing them in the right place ie on the main thread, or on a worker.
Specific Questions:
FastAPI
executes def
functions on a separate thread (I think this is a reasonable analogy). @nenb You are correct that the task queue is mostly useful for CPU bound tasks. But since we allow generic extensions, we cannot make any assumptions on the nature of the task. And since I/O bound tasks still work correctly with a task queue, this is just the common denominator. Indeed we lose nice-to-have features like #185, but it lowers the maintenance burden by only having one way to do everything.
- Is there some pattern/trick that could be adopted to only execute CPU-bound tasks on the queue, and everything else on the main thread. For example,
FastAPI
executesdef
functions on a separate thread (I think this is a reasonable analogy).
Yes, that could work, but we would need a bigger design for this. Let's say for example I have an CPU bound SourceStorage.store
, but an I/O bound Assistant.answer
. I could mark the former with def
and the latter async def
. But what are we going to do for Chat.answer
? Should we mark this sync and block over a mostly I/O operation or do we mark it async and accept the fact that users might be confused why awaiting the result blocks for a long time?
Answering my own question here:
I could mark the former with
def
and the latterasync def
. But what are we going to do forChat.answer
?
Well, I guess we can have both as discussed in https://github.com/Quansight/ragna/issues/183#issuecomment-1804834416.
Just "vomiting" my thoughts here for later:
def answer
and async def aanswer
. We leave it to the user to select the right method for the given components.
Meaning, if we have a CPU bound task and the user chooses the async endpoint, there won't be any async behavior and the execution will just block until the result is computed. On the flip side, if a user selects the sync endpoint, but the task is async, we still block and thus lose the async advantage.
Maybe we can find a way to warn in these suboptimal cases, but I don't want this to get annoying. We can't automateI think this can work :slightly_smiling_face:
We had a lengthy internal discussion about this last night and came to the following conclusion: the task queue is a premature optimization for an architecture that has changed quite a bit since we started on Ragna. As such, it has very little going for it. In addition, it causes real issues for users. And just because they are just nice-to-haves from a functionality perspective, it doesn't mean that users turn away because of them. Thus, our current plan is to remove the task queue completely and only later revisit this if the need arises.
@nenb I think the FastAPI analogy is pretty neat here. Sync functions will be run on a separate thread and async ones are just awaited in the main thread. That makes the Python API fully async and thus directly usable by the REST API as well.
One thing that I worry about is DX while debugging multithreaded code. Not sure if that is an issue, but I would prefer being able to use a debugger if possible. Still, this is not blocking for the design.
Development for this is happening in #205.
@pmeier Thanks for the updates, this sounds great. I'll follow #205 with considerable interest, and will post here if I have any further comments.
PS I do think that the task queue has a place for heavier workloads, or if someone is using ragna
to orchestrate RAG across many concurrent users (which I think is an intended use-case). So, I do think it could still have a role to play in the future, especially as huey
is so lightweight. I think you are aware of this already though!
Hi @nenb @pmeier I noticed your team removed the LLM requests queue? I'm curious why ?
I'm the maintainer of LiteLLM https://github.com/BerriAI/litellm and we have a queue implementation that handles 100+ req/seconds for Any LLM. Would it be helpful to use our request queue here - if not I'd love your feedback on why not ?
Here's the quick start on the litellm request queue
docs: https://docs.litellm.ai/docs/routing#queuing-beta
REDIS_HOST="my-redis-endpoint"
REDIS_PORT="my-redis-port"
REDIS_PASSWORD="my-redis-password" # [OPTIONAL] if self-hosted
REDIS_USERNAME="default" # [OPTIONAL] if self-hosted
$ litellm --config /path/to/config.yaml --use_queue
Here's an example config for gpt-3.5-turbo
config.yaml (This will load balance between OpenAI + Azure endpoints)
model_list:
- model_name: gpt-3.5-turbo
litellm_params:
model: gpt-3.5-turbo
api_key:
- model_name: gpt-3.5-turbo
litellm_params:
model: azure/chatgpt-v-2 # actual model name
api_key:
api_version: 2023-07-01-preview
api_base: https://openai-gpt-4-test-v-1.openai.azure.com/
$ litellm --test_async --num_requests 100
/queue/request
- Queues a /chat/completions request. Returns a job id. /queue/response/{id}
- Returns the status of a job. If completed, returns the response as well. Potential status's are: queued
and finished
.I'm going to start a branch soon to move the task queue to the Python API to see if there are more downsides that I'm currently not seeing. Will post updates here.
@pmeier I'd love to help here - we implemented a queue for LLM requests to handle 100+ request/second. If this is not a good enough solution- I'd love to know what's missing
@ishaan-jaff
I noticed your team removed the LLM requests queue?
We didn't remove an "LLM requests queue", but rather the task queue that handled everything inside the RAG workflow from extracting text, embedding it, retrieving it, and finally sending it to an LLM.
I'm curious why ?
You can read this thread for details. TL;DR it was confusing to users and limited our ability to implement things like streaming API responses. I see that LiteLLM supports streaming, but that is likely only possible because you hard depend on Redis as queue. We didn't want to take a hard dependency on that since it is not pip install
able.
Since hitting an LLM API is heavily I/O bound, we just went vanilla async execution instead.
Would it be helpful to use our request queue here - if not I'd love your feedback on why not ?
I don't want to have a mechanism only for the LLMs (assistants in Ragna lingo) unless absolutely necessary. Whatever abstraction we build should work for everything. Right now we have that.
What I think would be totally possible is to implement a class LiteLlmAssistant(ragna.core.Assistant)
. That way we are no longer bound to the requirement that the queue needs to handle more than just the LLMs. Still, this would be an extension, i.e. not land in Ragna core, since one needs extra infrastructure requirements (redis-server
) that cannot be installed by pip
. Happy to give pointers or help out if you want to give it a shot.
While responding to all issues since the launch (thank you everyone that took the time to get in touch!) it became somewhat evident that the queue backend is confusing / limiting.
In general, the task queue is used to scale deployments of Ragna. When auto scaling worker nodes on the queue length, that is all there is to do to load balance Ragnas backend, while keeping the REST API nodes constant (of cource they also need to be autoscaled at some point if you have a lot of concurrent users). Other options like a service mesh were a lot more complicated when we want to have one consistent REST API the user can hit extend with plugins.
Still, this doesn't answer why the queue is part of the Python API. Let me give you some historical context on how it came into being. In the very beginning, Ragna had no public Python API (#17). The REST API was the only real entrypoint for users. After I rewrote the project in #28, I failed to cleanly separate the two. Back then, the Python API even still had the database baked in (#65). I removed that in #67.
Meaning, right now, I can't come up with a good reason for why we have a task queue as part of the Python API. Not having it could potentially simplify the implementation and enable us to implement a few nice to have features, such as #185.
The only downside I can currently come up with is that parallelizing tasks the Ragna API will be harder in the future. Right now, you can simply select a non-memory task queue and start a worker and be done with it. Without a task queue the user has to implement their own parallelization scheme. That being said, since the Python API is mostly meant for experimentation, I'm not sure if this is a strong argument.
I'm going to start a branch soon to move the task queue to the Python API to see if there are more downsides that I'm currently not seeing. Will post updates here.