refactor: agnostic to ML backend

francis2tm commented 5 months ago

Overview

In the future, we want to support multiple ML backends for each endpoint

Example: chat/completions can use:

llama.cpp
candle
tensorrt

audio/transcriptions can use:

whisper.cpp
candle

images/generations can use:

candle

toschoo commented 5 months ago

Currently, there is a coupling of endpoints and model type (model "kind" in our code):

endpoints request a specific model type (llama, whisper)
the preload facility abuses the model type to know the endpoint (especially for reporting download progress).

The second issue is trivial to solve: just pass the endpoint explicitly to preload and the problem is solved (we do a similar thing in integration tests).

The first issue causes more headaches. Possible solutions:

In the past preload guessed the model type based on the model name. This may be unreliable when models do not adhere to naming conventions. Further more, there are ambiguities: "candle" is a framework that, for now, may imply a certain model type (image generation), but will, on the long run, take care of other types as well.
We use a config parameter or naming convention for model names, e.g. "Owner/Repo/Model/Type". The obvious disadvantage is that the user, who may not be an ML expert, is burdened with this kind of decisions.
Create a "model database" in which models that we support are listed including the model type. This database may just be a yaml or toml file that can easily be extended by users. The database would include the type for each model. The advantage of this approach is that it is, at the same time, a step towards a model repository (which, as I believe, will be needed on the long run anyway). The drawback is that edgen becomes less agile. One cannot just add a model to the config and - done! Users have to go through one more step, namely registering the model itself.
A mixed approach: we try to get the information from various sources: huggingface, model database, model name. This would relieve non-ML experts to provide the information but enables advanced users to add additional models and additional information. The disadvantage is the increased complexity.

toschoo commented 5 months ago

Here's my solution: provide a file of the form

[llama.cpp]
neural-chat
phi
TinyLlama
zephyr
nomic-embed
mistral
...

[whisper.cpp]
distill
whisper
...

The file lives in the models directory (i.e. data_dir/models). Expert users can and are encouraged to add their own models to the list. We can later extend the format to add additional information. But that's for the future.

Whenever we pass a request to a model endpoint, we first find out, which endpoint we need:

we try the sections of the file sequentially (if a section is not appropriate for the given endpoint, we skip it).
if the model name or repo contains one of the strings in the list, we call the ML backend of the corresponding section in the file (e.g. llama/whisper).
if it fails, we try the next eligible section.
if there is no more eligible section, we fail with "UnknownModelKind".

francis2tm commented 5 months ago

It looks promising. However, I have a few concerns with this:

There are already a lot of models that llama.cpp supports (4k+). Could we do like:
```
[llama.cpp]
*.gguf
```
Or any model from huggingface that has the tag GGUF. (https://huggingface.co/models?library=gguf)
Upgrading - what happens if someone has an old version and there's an update which should cause the file to be updated?

toschoo commented 5 months ago

The extension solution (*.gguf) works as a default. When we also want to use candle for examples, we need to add the patterns that we want to be handled by this backend on one of the sides. Otherwise, all models with the extension gguf with always be handled by the same backend. Note that the strings there are not complete model names but patterns. If a model name contains the substring "chat", for example, then we send it to llm.
I don't understand this comment. Note, again, that the strings in the list are patterns not complete model names.

toschoo commented 4 months ago

Implementation:

The llama models are distinguished by the pattern gguf
The model patterns file is accessible in data_dir/models
I did not implement the "try" logic: the selected backend either works or not; I feel that trying out backends is cumbersome and should be avoided if possible. todo:
The model patterns file is not yet refreshed on update.
The use of the patterns file should be documented, but we should gain some experience ourselves before we explain it to users. (I would consider the feature "experimental" for the moement.
We should see if the patterns solution is sufficiently distinctive or needs improvement.

edgenai / edgen

refactor: agnostic to ML backend #120

Overview