arcee-ai / fastmlx

FastMLX is a high performance production ready API to host MLX models.
Other
220 stars 24 forks source link

Feature Request: Integrate Features from Ollama #18

Open evertjr opened 4 months ago

evertjr commented 4 months ago

Description:
One of the reasons Ollama is so widely adopted as a tool to run local models is its ease of use and seamless integration with other tools. Users can simply install an app that starts a server on the machine along with a terminal CLI to download and manage models. It would be beneficial to integrate several features from Ollama into FastMLX to enhance user experience and functionality.

Suggested Features:

  1. Simple App to Start with the System:

    • Develop a lightweight desktop application that starts with the system and provides easy access to FastMLX's functionalities. This application should have a system tray icon for quick settings and access.
  2. CLI Client to Manage Models:

    • Implement a command-line interface (CLI) for downloading and managing models. This CLI should offer commands similar to those in Ollama, such as:
      • fastmlx run gemma2 - To test the selected model in the terminal
      • fastmlx pull gemma2 - To download a specified model.
      • fastmlx rm gemma2 - To remove a specified model.
      • fastmlx list - To list all downloaded models.
stewartugelow commented 4 months ago

What sorts of things are you thinking about in terms of Ollama’s integration with other tools?

evertjr commented 4 months ago

My main use case for Ollama is for coding assistance. It works with the continue.dev extension on vscode for both chat and tab autocomplete and it works great. There's also native apps like mindmac to chat with local models, there's even a Raycast extension which can quickly receive selected files and text as context.

stewartugelow commented 4 months ago

Here's what I could find on your ideas:

  1. Simple app to start the system

Looks like this can be done with a combination of launchctl and a bundled applescript application, according to ChatGPT

https://chatgpt.com/share/06771026-c28d-4019-9fab-c4dbd5af4a1d

  1. The CLI Client to Manage Models

The code for this essentially exists in the API endpoints, so it's just a matter of whether @Blaizzy wants to mimic the ollama syntax in the CLI, too.

  1. Integration with other tools

I looked at the examples you gave, and, as far as I can tell, those tools are simply including presets of the API endpoints and chat model as a convenience. Maybe there could be a section added to the docs on "integrating FastMLX into your app" that users could send to application developers. Although @Blaizzy might want to pick a port other than 8000 as a default that would be unlikely to conflict with any other FastAPI installs. (Ollama uses port 11434 as a default.)

Hope this helps!

Blaizzy commented 4 months ago

Hey @evertjr and @stewartugelow

Thank you very much for the discussion!

I think I got a high-level idea of use cases.

Regarding number 1, I will make it a part of LisaPro my local coding assistant for Mac offer. We are launching end of this month.

Regarding number 2, as @stewartugelow suggested, the APIs already exist. It’s a matter adding those CLI commands.

Regarding number 3, @stewartugelow could you to add the examples here #19 in the docs/examples directory?

stewartugelow commented 4 months ago

Regarding number 1 — when do you sleep??? :)

Regarding number 3 — I’m happy to, but I’m not sure what you mean?

Blaizzy commented 3 months ago
  1. Never :)

  2. I meant add the code/cookbook example on how to integrate FastMLX with different apps.

Blaizzy commented 3 months ago

@evertjr I will add the features you requested (CLI Client to Manage Models) here after #21. It will come alongside a FastMLX Python client so you can start FastMLX programmatically.

The lightweight app is on the backlog at the moment.

viljark commented 2 months ago

First of all thank you for the awesome work! I would be very interested in the following Ollama features to be available with fastmlx:

  1. keep_alive: controls how long the model will stay loaded into memory following the request (default: 5m). I would like the model to be unloaded if it is not used in a while.
  2. OLLAMA_MAX_LOADED_MODELS: Set to zero for fully dynamic based on VRAM capacity, or a fixed number greater than 1 to limit the total number of loaded models regardless of VRAM capacity. This way model switching would be seamless by just unloading other models if VRAM limits are reaching (more info).
  3. v1/models to return OpenAI api compatible response with all installed local models listed, not just the loaded ones

These features would be really nice QOL improvements when you like to switch models often, run split chats with different models and still keep the system performance optimal. The biggest benefit from the mlx would be model loading speed, which on my M3 mac is more than 2x faster than in Ollama and reduced memory usage..