edgenai / edgen

⚡ Edgen: Local, private GenAI server alternative to OpenAI. No GPU required. Run AI models locally: LLMs (Llama2, Mistral, Mixtral...), Speech-to-text (whisper) and many others.
https://docs.edgen.co/
Apache License 2.0
323 stars 14 forks source link

refactor: agnostic to ML backend #120

Closed francis2tm closed 4 months ago

francis2tm commented 5 months ago

Overview

In the future, we want to support multiple ML backends for each endpoint

Example: chat/completions can use:

audio/transcriptions can use:

images/generations can use:

toschoo commented 5 months ago

Currently, there is a coupling of endpoints and model type (model "kind" in our code):

The second issue is trivial to solve: just pass the endpoint explicitly to preload and the problem is solved (we do a similar thing in integration tests).

The first issue causes more headaches. Possible solutions:

toschoo commented 5 months ago

Here's my solution: provide a file of the form

[llama.cpp]
neural-chat
phi
TinyLlama
zephyr
nomic-embed
mistral
...

[whisper.cpp]
distill
whisper
...

The file lives in the models directory (i.e. data_dir/models). Expert users can and are encouraged to add their own models to the list. We can later extend the format to add additional information. But that's for the future.

Whenever we pass a request to a model endpoint, we first find out, which endpoint we need:

francis2tm commented 5 months ago

It looks promising. However, I have a few concerns with this:

  1. There are already a lot of models that llama.cpp supports (4k+). Could we do like:

    [llama.cpp]
    *.gguf

    Or any model from huggingface that has the tag GGUF. (https://huggingface.co/models?library=gguf)

  2. Upgrading - what happens if someone has an old version and there's an update which should cause the file to be updated?

toschoo commented 5 months ago
  1. The extension solution (*.gguf) works as a default. When we also want to use candle for examples, we need to add the patterns that we want to be handled by this backend on one of the sides. Otherwise, all models with the extension gguf with always be handled by the same backend. Note that the strings there are not complete model names but patterns. If a model name contains the substring "chat", for example, then we send it to llm.
  2. I don't understand this comment. Note, again, that the strings in the list are patterns not complete model names.
toschoo commented 4 months ago

Implementation: