Add llamacpp as cpu accelerate service

Aisuko commented 2 weeks ago

Description

This PR

fixed: #153 fixed: #162

And depends on https://github.com/SkywardAI/containers/pull/32

Notes for Reviewers

I'm trying to add customise llama.cpp image to the backend as the inference engine. It has some advantages:

FP16 CPU/GPU accelerate
Support arm64

And also a new API version was added in this PR

In the future, I have a plan to bounding or replacing it with Rust, but currently it can provides us smoothly inference experience on CPU.

Test commands

For more parameters: https://github.com/SkywardAI/llama.cpp/tree/master/examples/server

docker run -p 8080:8080 -v ./models:/models gclub/llama.cpp:server--b1-7a6db5c -m models/gpt2-117m-Q4_K_M-v2.gguf -c 512 --port 8080 --host 0.0.0.0

curl --request POST     --url http://0.0.0.0:8080/completion     --header "Content-Type: application/json"     --data '{"prompt": "Building a website can be done in 10 simple steps:","n_predict": 128}'

Inference on 8 CPU with a reasonable speed

~/workspace/chat-backend/models/gguf$ curl --request POST     --url http://0.0.0.0:8080/completion     --header "Content-Type: application/json"     --data '{"prompt": "Building a website can be done in 10 simple steps:","n_predict": 128}'
{"content":"\n\n2) Create a simple-to-use template and a content template\n\nIf you get a little flaky and forget about the more difficult stuff, you're correct. Creating a template is much more difficult.\n\nTo succeed, you need to perform the hardest (and the best) work. Because of the inefficiency of a template, a more difficult and less convenient setup may be in your interest. These are the tasks that we can work on instead of doing.\n\nSetting up a site is like the two-part \"solution\" of building a blog. Each of these is difficult and quite expensive.","id_slot":0,"stop":true,"model":"models/gpt2-117m-Q4_K_M-v2.gguf","tokens_predicted":128,"tokens_evaluated":11,"generation_settings":{"n_ctx":512,"n_predict":-1,"model":"models/gpt2-117m-Q4_K_M-v2.gguf","seed":4294967295,"temperature":0.800000011920929,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":40,"top_p":0.949999988079071,"min_p":0.05000000074505806,"tfs_z":1.0,"typical_p":1.0,"repeat_last_n":64,"repeat_penalty":1.0,"presence_penalty":0.0,"frequency_penalty":0.0,"penalty_prompt_tokens":[],"use_penalty_prompt_tokens":false,"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"penalize_nl":false,"stop":[],"n_keep":0,"n_discard":0,"ignore_eos":false,"stream":false,"logit_bias":[],"n_probs":0,"min_keep":0,"grammar":"","samplers":["top_k","tfs_z","typical_p","top_p","min_p","temperature"]},"prompt":"Building a website can be done in 10 simple steps:","truncated":false,"stopped_eos":false,"stopped_word":false,"stopped_limit":true,"stopping_word":"","tokens_cached":138,"timings":{"prompt_n":11,"prompt_ms":15.658,"prompt_per_token_ms":1.4234545454545453,"prompt_per_second":702.5162856048026,"predicted_n":128,"predicted_ms":383.604,"predicted_per_token_ms":2.99690625,"predicted_per_second":333.6774381914683}}

Models address:

Question

Shall we use volume? I don't see any pre-defined docker volume. I'd like to add a small model 110MB size to the repo. It can be tracked with git lfs. And after execution the make up command, docker compose will load the models from local. What is your opinion? @Micost

Signed commits

[x] Yes, I signed my commits.

Micost commented 2 weeks ago

I would like to have the volumes/models/ folder which we used for local models. Please take a look to the docker-compose.yaml Here are the lines where we saved model volumes:

./backend/:/app/
${DOCKER_VOLUME_DIRECTORY:-.}/volumes/models:/models

Aisuko commented 2 weeks ago

We will have two type of models. .safetensors and .gguf in the future. The first one will be the customised model and it can be used for CPT. gguf for inference on CPU.

We should add them to the folder under volumes

Aisuko commented 2 weeks ago

The chat API currently use llamacpp. However, I comment the code like download models and score. We need to discuss these features, let them work well with new inference method.

Below is the test log, I use dev container development environment on AWS. The model I am using is gpt2-minimal

prompt below

{
  "accountID": 0,
  "sessionId": 0,
  "message": "Building a website can be done in 10 simple steps:"
}

Response

2024-06-27 05:29:38.539 | INFO     | src.repository.rag.chat:inference:199 - inference answer value:{'content': '\n\n1) Request the .html on the web\n\n2) Add an HTML element and edit the "Content-Type" for it.\n\n3) Type in a new "Content-Type" that is the same as the "Content-Image" you downloaded from the server.\n\n4) Within a few seconds, the "Content-Type" that you clicked on is the same as the "Content-Image" that is on your desktop or a Google Sheet.\n\nIn the future you can recreate the content and edit the content using this template.\n\nI hope this has helped, and if not', 'id_slot': 0, 'stop': True, 'model': 'models/gpt2-minimal-Q4_K_M-v2.gguf', 'tokens_predicted': 128, 'tokens_evaluated': 11, 'generation_settings': {'n_ctx': 512, 'n_predict': -1, 'model': 'models/gpt2-minimal-Q4_K_M-v2.gguf', 'seed': 4294967295, 'temperature': 0.800000011920929, 'dynatemp_range': 0.0, 'dynatemp_exponent': 1.0, 'top_k': 40, 'top_p': 0.949999988079071, 'min_p': 0.05000000074505806, 'tfs_z': 1.0, 'typical_p': 1.0, 'repeat_last_n': 64, 'repeat_penalty': 1.0, 'presence_penalty': 0.0, 'frequency_penalty': 0.0, 'penalty_prompt_tokens': [], 'use_penalty_prompt_tokens': False, 'mirostat': 0, 'mirostat_tau': 5.0, 'mirostat_eta': 0.10000000149011612, 'penalize_nl': False, 'stop': [], 'n_keep': 0, 'n_discard': 0, 'ignore_eos': False, 'stream': False, 'logit_bias': [], 'n_probs': 0, 'min_keep': 0, 'grammar': '', 'samplers': ['top_k', 'tfs_z', 'typical_p', 'top_p', 'min_p', 'temperature']}, 'prompt': 'Building a website can be done in 10 simple steps:', 'truncated': False, 'stopped_eos': False, 'stopped_word': False, 'stopped_limit': True, 'stopping_word': '', 'tokens_cached': 138, 'timings': {'prompt_n': 11, 'prompt_ms': 15.344, 'prompt_per_token_ms': 1.3949090909090909, 'prompt_per_second': 716.8925964546403, 'predicted_n': 128, 'predicted_ms': 444.658, 'predicted_per_token_ms': 3.473890625, 'predicted_per_second': 287.8616824615772}}
INFO:     127.0.0.1:55970 - "POST /api/chat HTTP/1.1" 201 Created

Aisuko commented 2 weeks ago

Screenshot 2024-06-28 at 11 12 33 AM

Currently, the RAG feature is unable to use. I'm thinking we need an efficient encoder and decoder that support both CPU and GPU. And I'm working on it.

@Micost Is ok we merge this one first? (we do not release version only merge the PR) Currently, the deployment process without download any model. And we can focus on the CRM features with @cbh778899 I mean it can be easier for him deploy backend to his local env.

SkywardAI / kirin