Quansight / ragna

RAG orchestration framework ⛵️
https://ragna.chat
BSD 3-Clause "New" or "Revised" License
178 stars 22 forks source link

queueless Ragna? #204

Closed pmeier closed 9 months ago

pmeier commented 9 months ago

While responding to all issues since the launch (thank you everyone that took the time to get in touch!) it became somewhat evident that the queue backend is confusing / limiting.

In general, the task queue is used to scale deployments of Ragna. When auto scaling worker nodes on the queue length, that is all there is to do to load balance Ragnas backend, while keeping the REST API nodes constant (of cource they also need to be autoscaled at some point if you have a lot of concurrent users). Other options like a service mesh were a lot more complicated when we want to have one consistent REST API the user can hit extend with plugins.

Still, this doesn't answer why the queue is part of the Python API. Let me give you some historical context on how it came into being. In the very beginning, Ragna had no public Python API (#17). The REST API was the only real entrypoint for users. After I rewrote the project in #28, I failed to cleanly separate the two. Back then, the Python API even still had the database baked in (#65). I removed that in #67.

Meaning, right now, I can't come up with a good reason for why we have a task queue as part of the Python API. Not having it could potentially simplify the implementation and enable us to implement a few nice to have features, such as #185.

The only downside I can currently come up with is that parallelizing tasks the Ragna API will be harder in the future. Right now, you can simply select a non-memory task queue and start a worker and be done with it. Without a task queue the user has to implement their own parallelization scheme. That being said, since the Python API is mostly meant for experimentation, I'm not sure if this is a strong argument.

I'm going to start a branch soon to move the task queue to the Python API to see if there are more downsides that I'm currently not seeing. Will post updates here.

nenb commented 9 months ago

thank you everyone that took the time to get in touch!

In my experience, contributions and engagement are tightly linked to how open the core dev team is to feedback. The ragna team have been extremely receptive IMO (this issue being a perfect example), thank you for sowing the seeds of good culture! :)

I'm going to start a branch soon to move the task queue to the Python API to see if there are more downsides that I'm currently not seeing. Will post updates here.

Thanks for making this public. IMO It's the right approach to have two branches in parallel - there are so many 'breakthroughs' right now in Gen AI that it feels like a good idea to explore both paths in parallel for as long as possible.

That being said, I'm not fully sure I understand the rationale for removing the queue only from the Python API.

My current understanding is that a queue is useful for CPU-bound tasks in Python. The most common CPU-bound tasks that I am familiar with for RAG applications are i) loading and chunking the documents, ii) computing the embeddings and iii) performing a vector-similarity search or lexical-search. There are other less common (for now!) tasks that I haven't mentioned such as re-ranking, but I'll ignore these. These three sub-tasks are currently (mostly) abstracted behind the prepare() interface in ragna. (Of course, all of these become I/O-bound tasks if they are computed on a server elsewhere eg using OpenAI to compute the embeddings.)

For I/O bound tasks and for an application that already uses async methods eg FastAPI web-server, I don't really see the benefit of using a queue (at least for workloads that I have experienced in the past). When interacting with an external API like OpenAI, I don't think there is much benefit to waiting for a response on a worker, as opposed to waiting for a response on the main thread. Disadvantages include issues like #185.

So, my point is, why restrict the queue experimentation branch only to the Python API? Surely the web server/API would also benefit from this. The key for me is distinguishing between I/O bound and CPU bound tasks, and then executing them in the right place ie on the main thread, or on a worker.

Specific Questions:

  1. Is there any benefit to having the interaction with the LLM take place via the queue/on a worker, rather than on the main thread ? It prevents features like #185, and it's taking up resources idling on the queue.
  2. Is there some pattern/trick that could be adopted to only execute CPU-bound tasks on the queue, and everything else on the main thread. For example, FastAPI executes def functions on a separate thread (I think this is a reasonable analogy).
pmeier commented 9 months ago

@nenb You are correct that the task queue is mostly useful for CPU bound tasks. But since we allow generic extensions, we cannot make any assumptions on the nature of the task. And since I/O bound tasks still work correctly with a task queue, this is just the common denominator. Indeed we lose nice-to-have features like #185, but it lowers the maintenance burden by only having one way to do everything.

  1. Is there some pattern/trick that could be adopted to only execute CPU-bound tasks on the queue, and everything else on the main thread. For example, FastAPI executes def functions on a separate thread (I think this is a reasonable analogy).

Yes, that could work, but we would need a bigger design for this. Let's say for example I have an CPU bound SourceStorage.store, but an I/O bound Assistant.answer. I could mark the former with def and the latter async def. But what are we going to do for Chat.answer? Should we mark this sync and block over a mostly I/O operation or do we mark it async and accept the fact that users might be confused why awaiting the result blocks for a long time?

pmeier commented 9 months ago

Answering my own question here:

I could mark the former with def and the latter async def. But what are we going to do for Chat.answer?

Well, I guess we can have both as discussed in https://github.com/Quansight/ragna/issues/183#issuecomment-1804834416.

pmeier commented 9 months ago

Just "vomiting" my thoughts here for later:

I think this can work :slightly_smiling_face:

pmeier commented 9 months ago

We had a lengthy internal discussion about this last night and came to the following conclusion: the task queue is a premature optimization for an architecture that has changed quite a bit since we started on Ragna. As such, it has very little going for it. In addition, it causes real issues for users. And just because they are just nice-to-haves from a functionality perspective, it doesn't mean that users turn away because of them. Thus, our current plan is to remove the task queue completely and only later revisit this if the need arises.

@nenb I think the FastAPI analogy is pretty neat here. Sync functions will be run on a separate thread and async ones are just awaited in the main thread. That makes the Python API fully async and thus directly usable by the REST API as well.

One thing that I worry about is DX while debugging multithreaded code. Not sure if that is an issue, but I would prefer being able to use a debugger if possible. Still, this is not blocking for the design.

pmeier commented 9 months ago

Development for this is happening in #205.

nenb commented 9 months ago

@pmeier Thanks for the updates, this sounds great. I'll follow #205 with considerable interest, and will post here if I have any further comments.

PS I do think that the task queue has a place for heavier workloads, or if someone is using ragna to orchestrate RAG across many concurrent users (which I think is an intended use-case). So, I do think it could still have a role to play in the future, especially as huey is so lightweight. I think you are aware of this already though!

ishaan-jaff commented 9 months ago

Hi @nenb @pmeier I noticed your team removed the LLM requests queue? I'm curious why ?

I'm the maintainer of LiteLLM https://github.com/BerriAI/litellm and we have a queue implementation that handles 100+ req/seconds for Any LLM. Would it be helpful to use our request queue here - if not I'd love your feedback on why not ?

Here's the quick start on the litellm request queue

docs: https://docs.litellm.ai/docs/routing#queuing-beta

Quick Start

  1. Add Redis credentials in a .env file
REDIS_HOST="my-redis-endpoint"
REDIS_PORT="my-redis-port"
REDIS_PASSWORD="my-redis-password" # [OPTIONAL] if self-hosted
REDIS_USERNAME="default" # [OPTIONAL] if self-hosted
  1. Start litellm server with your model config
$ litellm --config /path/to/config.yaml --use_queue

Here's an example config for gpt-3.5-turbo

config.yaml (This will load balance between OpenAI + Azure endpoints)

model_list: 
  - model_name: gpt-3.5-turbo
    litellm_params: 
      model: gpt-3.5-turbo
      api_key: 
  - model_name: gpt-3.5-turbo
    litellm_params: 
      model: azure/chatgpt-v-2 # actual model name
      api_key: 
      api_version: 2023-07-01-preview
      api_base: https://openai-gpt-4-test-v-1.openai.azure.com/
  1. Test (in another window) → sends 100 simultaneous requests to the queue
$ litellm --test_async --num_requests 100

Available Endpoints

ishaan-jaff commented 9 months ago

I'm going to start a branch soon to move the task queue to the Python API to see if there are more downsides that I'm currently not seeing. Will post updates here.

@pmeier I'd love to help here - we implemented a queue for LLM requests to handle 100+ request/second. If this is not a good enough solution- I'd love to know what's missing

pmeier commented 9 months ago

@ishaan-jaff

I noticed your team removed the LLM requests queue?

We didn't remove an "LLM requests queue", but rather the task queue that handled everything inside the RAG workflow from extracting text, embedding it, retrieving it, and finally sending it to an LLM.

I'm curious why ?

You can read this thread for details. TL;DR it was confusing to users and limited our ability to implement things like streaming API responses. I see that LiteLLM supports streaming, but that is likely only possible because you hard depend on Redis as queue. We didn't want to take a hard dependency on that since it is not pip installable.

Since hitting an LLM API is heavily I/O bound, we just went vanilla async execution instead.

Would it be helpful to use our request queue here - if not I'd love your feedback on why not ?

I don't want to have a mechanism only for the LLMs (assistants in Ragna lingo) unless absolutely necessary. Whatever abstraction we build should work for everything. Right now we have that.

What I think would be totally possible is to implement a class LiteLlmAssistant(ragna.core.Assistant). That way we are no longer bound to the requirement that the queue needs to handle more than just the LLMs. Still, this would be an extension, i.e. not land in Ragna core, since one needs extra infrastructure requirements (redis-server) that cannot be installed by pip. Happy to give pointers or help out if you want to give it a shot.