SciPhi-AI / R2R

The most advanced Retrieval-Augmented Generation (RAG) system, containerized and RESTful
https://r2r-docs.sciphi.ai/
MIT License
3.65k stars 270 forks source link

Add support for LMStudio + MLX for maximal speed and efficiency on Apple Silicon #1495

Open AriaShishegaran opened 3 weeks ago

AriaShishegaran commented 3 weeks ago

Is your feature request related to a problem? Please describe. With the recent major improvements over LMStudio, including the headless mode, it is now a a powerful alternative to Ollama with a lot of appealing features, including native support for MLX for Apple Silicon devices offering huge inference time improvements for local models.

Based on my tests i achieved 300-500% speed improvement compared to using Ollama over the same model (Llama3.2-3B).

NolanTrem commented 3 weeks ago

I've been thinking about this as well… It seems that this is possible with LiteLLM as is, but it would be great if they had a full integration.

https://github.com/BerriAI/litellm/issues/3755

Part of the reason that we wouldn't look to support this natively in R2R (similarly, how we don't directly support Ollama any longer, and instead route through LiteLLM) is that we believe these integrations to not be a core part of our infrastructure. Rather than focusing on maintaining integrations, we would look to contribute to LiteLLM and other dependencies of ours.

I'll play around with this over the weekend and will add some information into the docs. Let us know if you're able to get it working!

AriaShishegaran commented 3 weeks ago

@NolanTrem The approach makes total sense what if there are 10 other providers, it makes sense to have a higher level router handle them and you just using them. As you said, at least having a robust doc for this could also be very helpful. I'll try to make it work and if it was successful I'll share my insights here.

NolanTrem commented 2 weeks ago

Had a chance to play around with LMStudio and was extremely impressed by its performance over Ollama. There were a few changes that I had to make to our LiteLLM provider file in order to get embeddings to work (which just included dropping unsupported parameters.) I'll look to make this a permanent change, as I would be inclined to switch over to LMStudio for most of my testing going forward.

Here's the config that I ended up running with:

[agent]
system_instruction_name = "rag_agent"
tool_names = ["search"]

  [agent.generation_config]
  model = "openai/llama-3.2-3b-instruct"

[completion]
provider = "litellm"
concurrent_request_limit = 1

  [completion.generation_config]
  model = "openai/llama-3.2-3b-instruct"
  base_dimension = 768
  temperature = 0.1
  top_p = 1
  max_tokens_to_sample = 1_024
  stream = false
  add_generation_kwargs = { }

[embedding]
provider = "litellm"
base_model = "openai/text-embedding-nomic-embed-text-v1.5"
base_dimension = 768
batch_size = 128
add_title_as_prefix = true
concurrent_request_limit = 2

[orchestration]
provider = "simple"