Local models - Githubissues

mudler commented 1 year ago

Hey :wave: !

Awesome project!

I'm trying to run chatgpt-web with llama.cpp, I've created a project using golang lama.cpp bindings https://github.com/go-skynet/llama-cli which mimics the OpenAI API to be 1:1 compatible but having multi-model that can run locally instead.

It seems all to work so far, and I'd like to document the usage to have them both working together, so to use it with local models. However, I'm struggling as chatgpt-web seems to filter the model from the API with openAI available models - llama-cli returns a list of models but the filtering chatgpt-web is doing prevents to select models from the list. (e.g. alpaca can't be run unless I do some hardwiring on the API).

If you want to test it, you need to run llama-cli from the latest image built from master, like so:

./llama-cli api --address 0.0.0.0:8080 --models-path models-path-here --threads 14

And set the VITE_API_BASE accordingly in the .env file.

It would be super-cool if could work together to have the capability to load local models, maybe directly adding options to run it aside with docker-compose (that's what I'm currently doing!) WDYT?

Niek commented 1 year ago

Thanks! llama-cli with the API addition sounds like a great match with ChatGPT-web! The models don't work because we hard-code explicit supported models: https://github.com/Niek/chatgpt-web/blob/1926f7df15b5bf099d1f0ad29740d35c98cfbbdf/src/lib/Types.svelte#L2-L9

This can be quite easily fixed though. I guess we should support everything with ggml and assume a $0 cost for these models. The model selection need some work in any case. I tested with ggml-vicuna-7b-4bit and it worked well, although the output was gibberish.

Are you planning on adding streaming support to the API as well (using EventSource/SSE)?

mudler commented 1 year ago

Thanks! llama-cli with the API addition sounds like a great match with ChatGPT-web! The models don't work because we hard-code explicit supported models:

https://github.com/Niek/chatgpt-web/blob/1926f7df15b5bf099d1f0ad29740d35c98cfbbdf/src/lib/Types.svelte#L2-L9

This can be quite easily fixed though. I guess we should support everything with ggml and assume a $0 cost for these models. The model selection need some work in any case.

Yup, managed to find that bit, so I was wondering what direction to take ( I don't like forking! ), however that sounds good here! I'd be more than happy then to provide a docker-compose file as well in llama-cli to redirect the users directly to chatgpt-web!

I tested with ggml-vicuna-7b-4bit and it worked well, although the output was gibberish.

It needs a prompt to be injected in each call, I've just updated the docs on the API to achieve that! https://github.com/go-skynet/llama-cli#web-interface : TLDR; just add a corresponding "model-file-name.bin.tmpl" with the default prompt, for instance:

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{{.Input}}

### Response:

(but for vicuna/chat I think it would be slightly different)

Are you planning on adding streaming support to the API as well (using EventSource/SSE)?

This comes with a high computational cost, so I'm not really going into that direction for now - CGO calls are really expensive, and if we want to stream token-by-token by calling behind the scene C functions direcly in go, that will likely bump response time by quite a lot.

mkellerman commented 1 year ago

Guys, i just wanna say thanks! This is a beautiful collaboration between two amazing projects!

mkellerman commented 1 year ago

In response to the models, i think we need to let the user add endpoints, instead of a since 'openai' url.

You want to use openai/gpt-4, you select the model from the drop down, and hit [+] to add a custom endpoint. and a custom return object.

And just give enough info in the docs on how to POST/GET from the custom endpoints.

mudler commented 1 year ago

Re: token streaming JFYI is being tracked on https://github.com/go-skynet/go-llama.cpp/issues/4, however I still think that would incur in a high computational cost decreasing the overall performance, but I'll be glad to take a stab at it next.

Niek / chatgpt-web

Local models #105