SciPhi-AI / R2R

The Elasticsearch for RAG. Build, scale, and deploy state of the art Retrieval-Augmented Generation applications
https://r2r-docs.sciphi.ai/
MIT License
3.25k stars 238 forks source link

R2R ollama Docker GPU Support #770

Open GMGassner opened 1 month ago

GMGassner commented 1 month ago

Thank you so much for this project and your efforts to make GraphRAG accessible for the masses!

Is your feature request related to a problem? Please describe.

Systems with an appropriate GPU(s) might prefer to run the models of the local ollama Docker deployment with the GPU support and could do so by editing their compose file. However, implementing this into R2R directly might be a valuable addition for non-mac users.

Describe the solution you'd like

Enable R2R to run with docker containers with native CUDA GPU support by passing the GPU flag. This might be implemented with the --gpus=all flag e.g.

docker run --gpus=all -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

or preferably within the compose file e.g.

networks:
  r2r-network:
    name: r2r-network

services:
  r2r:
    depends_on:
      ollama:
        condition: service_healthy

  ollama:
    image: ollama/ollama
    ports:
      - "11434:11434"
    environment:
      - OLLAMA_HOST=0.0.0.0
    volumes:
      - ollama_data:/root/.ollama
    networks:
      - r2r-network
    healthcheck:
      test: ["CMD", "ollama", "ps"]
      interval: 10s
      timeout: 5s
      retries: 5
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

volumes:
  ollama_data:

By setting count: all, you're allowing the container to access all available GPUs on your system. If you want to limit it to a specific number of GPUs, you can replace all with a number, like 2 to use only two GPUs.

From my understanding, this might also be solved by creating a flag like e.g. use-gpu-all or use-gpu-n (while n is the number of GPUs) to pass the GPU argument to a compose file or docker commands.

Describe alternatives you've considered

Running on CPU and RAM, but much slower and not so efficient.

Thanks again and best wishes.

emrgnt-cmplxty commented 1 month ago

Thanks for taking the time to share that this is an available feature in ollama!

We can certainly add this.

Do you have any idea how ollama w/ gpu support compares with vllm throughput? Perhaps we'd be better off bundling vllm for users with GPUs? vllm is optimized for GPU use cases and last I checked it offered significant speedups.

GMGassner commented 1 month ago

Thanks for the quick reply and also taking the time to address my feature request.

I opted for simplicity of adding the feature and wide user base for easy troubleshooting in my FR. But if you would consider putting more resources into GPU enabling the docker deployment, would be amazing.

Vllm should be significantly faster and efficient for local inference. I'm unsure how well embeddings work since there seemed to be issues in the past.

Adding vllm as a more efficient way if it works with the required setup would be an ideal next step. If there are some temporary hurdles to get vllm working, a GPU ollama implementation would be great anyways since it doesn't seem to interfere with anything else.

I could give it a try with vllm if it would help the R2R community and you devs.

emrgnt-cmplxty commented 1 month ago

Please do so and let us know how it goes, it should be easy to use thanks to LiteLLM.

If you'd like to help author the PR to get vllm into the codebase we are happy to provide support. Otherwise, it is on our todo list and we can get back to you when it is fully online =).

ralyodio commented 1 month ago

can we use r2r outside of docker on arch?

underlines commented 1 month ago

vLLM just recently merged embedding support here. But what I see is they basically just support LLM based embedding generation. Which are basically fine-tuned regular LLMs that use an instruction, then take out the vector embeddings from the answer.

Advantage: They rank very high on the Massive Text Embedding Benchmark Leaderboar, but take a ton of resources, as it's as intense as an LLM often with a min of 2B+ parameters. The advantage is, to my understanding, you can run them on any backend that can do regular LLM inference.

Usually we can reach almost similar SOTA Embedding quality by using SentenceTransformer: There is much wider support for a vast variety of embedding models, that support SentenceTransformer. Also, those models are usually much smaller than LLM based embedding models. They contain less than 1B params, and usually take up much less RAM and run quite efficiently on CPU, tho still slow.

Many of the most widely used embedding models have SentenceTransformer support: https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1 https://huggingface.co/WhereIsAI/UAE-Large-V1 https://huggingface.co/intfloat/multilingual-e5-large-instruct https://huggingface.co/mixedbread-ai/mxbai-embed-2d-large-v1 https://huggingface.co/nomic-ai/nomic-embed-text-v1.5

Even FlagEmbeddings have a SentenceTransformer implementation, though not supporting some of the cool features that FlagEmbeddings have: https://huggingface.co/BAAI/bge-base-en-v1.5

Sadly afaik even ollama doesn't support regular SentenceTransformer Models yet and just has support for a few embedding models.

While going the vLLM route is quite cool for us people who use R2R with enterprise clients (like I mainly do), supporting ollama first, is maybe a better strategy. You would go from simple to complex. But consider choosing whatever supports SentenceTransformers now or in the futures, as you would get a broad range of embedding model support.