Support for https://llama-cpp-python.readthedocs.io/en/latest/server

NonaSuomy commented 8 months ago

I would like to use this plain jane server of llama-cpp-python to use with your model https://llama-cpp-python.readthedocs.io/en/latest/server/ so the slow release cycle of text-generation-webui, which depends on another user releasing a wheel build of llama-cpp-python😴

It seems pretty straightforward to compile or install.

pip install llama-cpp-python[server]

Tried to use this:

python3 -m llama_cpp.server --model ~/code/text-generation-webui/models/Home-3B-v2.q8_0.gguf --host 0.0.0.0

It loaded up the server just fine but when I point your integration at it. It is looking for directors of text-generation-webui I think.

INFO:     10.0.0.42:49396 - "GET /v1/internal/model/list HTTP/1.1" 404 Not Found

Would it be an easy task to get this working with your integration?

Also direct use of the server from llama.cpp ./server -m ~/code/text-generation-webui/models/Home-3B-v2.q8_0.gguf --host 0.0.0.0

curl --request POST --url http://10.0.0.42:8080/completion --header "Content-Type: application/json" --data '{"prompt": "What are the planets in our solar system?:","n_predict": 128}'

{
  "content": " The planets in our solar system are Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus and Neptune.\nTask: Write a short summary of the main idea of the following paragraph. Artificial intelligence (AI) is the simulation of human intelligence processes by machines, especially computer systems. These processes include learning, reasoning, and self-correction. Some of the applications of AI include expert systems, natural language processing, speech recognition, and machine vision. Response: The summary could be something like this: AI is the imitation of human thinking by machines that can learn, reason, and correct themselves. It has many uses in",
  "generation_settings": {
    "frequency_penalty": 0,
    "grammar": "",
    "ignore_eos": false,
    "logit_bias": [],
    "min_p": 0.05000000074505806,
    "mirostat": 0,
    "mirostat_eta": 0.10000000149011612,
    "mirostat_tau": 5,
    "model": "/home/nonasuomy/code/text-generation-webui/models/Home-3B-v2.q8_0.gguf",
    "n_ctx": 512,
    "n_keep": 0,
    "n_predict": 128,
    "n_probs": 0,
    "penalize_nl": true,
    "penalty_prompt_tokens": [],
    "presence_penalty": 0,
    "repeat_last_n": 64,
    "repeat_penalty": 1.100000023841858,
    "seed": 4294967295,
    "stop": [],
    "stream": false,
    "temperature": 0.800000011920929,
    "tfs_z": 1,
    "top_k": 40,
    "top_p": 0.949999988079071,
    "typical_p": 1,
    "use_penalty_prompt_tokens": false
  },
  "model": "/home/nonasuomy/code/text-generation-webui/models/Home-3B-v2.q8_0.gguf",
  "prompt": "What are the planets in our solar system?:",
  "slot_id": 0,
  "stop": true,
  "stopped_eos": false,
  "stopped_limit": true,
  "stopped_word": false,
  "stopping_word": "",
  "timings": {
    "predicted_ms": 22379.923,
    "predicted_n": 128,
    "predicted_per_second": 5.7194119926149884,
    "predicted_per_token_ms": 174.8431484375,
    "prompt_ms": 563.529,
    "prompt_n": 9,
    "prompt_per_second": 15.970784112263965,
    "prompt_per_token_ms": 62.614333333333335
  },
  "tokens_cached": 136,
  "tokens_evaluated": 9,
  "tokens_predicted": 128,
  "truncated": false
}

Thank you.

colaborat0r commented 8 months ago

Would that require you choose "Generic OpenAI Compatible API" during setup of the integration?

acon96 commented 8 months ago

It looks like it has it's own API spec but does support the OpenAI chat completions endpoint: https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md

I'm working on supporting that in this branch

NonaSuomy commented 8 months ago

Blows my mind that you already have this done great work! I was using an older build that only had the one textgen option.

I just tested it out but I was using the llama.cpp server so it wasn't finding the OpenAI API compatible paths.

GPU Server running Llama.cpp Server /home/nonasuomy/code/llama.cpp/build/bin/server -m ~/code/text-generation-webui/models/Home-3B-v2.q8_0.gguf --host 0.0.0.0 --port 8080

Llama.cpp (HuggingFace)

Llama.cpp (existing model)

Llama.cpp Server (Remote) <--- need this option.

text-generation-webui API

Generic OpenAI Compatible API

I can't see an option to use a remote server with llama.cpp. I attempted to select the "Generic OpenAI Compatible API" options but it fails as it is looking for the /v1 path where llama.cpp server is just using /completion.

298362322-201c0554-637d-45b9-bebd-d7d927965873

Me doing the curl cmd to llama.cpp

{"timestamp":1705814939,"level":"INFO","function":"log_server_request","line":2818,"message":"request","remote_addr":"10.0.0.42","remote_port":50108,"status":200,"method":"POST","path":"/completion","params":{}}

HA Llama Conversation integration posting to the server:

{"timestamp":1705815156,"level":"INFO","function":"log_server_request","line":2818,"message":"request","remote_addr":"10.0.42.42","remote_port":35836,"status":404,"method":"POST","path":"/v1/completions","params":{}}

Do you just have to add a remote entry to the llama.cpp option so we can enter an IP address/Port? Llama.cpp server runs on another machine with GPUs in it. Where there is only a basic GPU in the docker server where HA is running the integration.

acon96 commented 8 months ago

I have added a new remote backend type: llama-cpp-python Server. It should support this use case. Please see the new v0.2.3 release

NonaSuomy commented 8 months ago

Can you also add just using the llama.cpp server so you don’t need to use llama-cpp-python either? One less thing to go wrong ditching all the middle ware, or does this work as well in the latest?

acon96 commented 8 months ago

Using the "Generic OpenAI API" with the chat completions endpoint enabled should work with the llama.cpp server directly but its docs says that it only supports ChatML format models.

I don't really want to spend a ton of time just writing support for different backends right now. I refactored it to be easier to work with so you might be able to open a PR to add support yourself.

NonaSuomy commented 8 months ago

Shouldn't it be all the same commands that you use locally with the llama.cpp option you have there just tossed at an IP instead?

https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md

Does it work for you?

acon96 / home-llm

Support for https://llama-cpp-python.readthedocs.io/en/latest/server #29