Closed Aisuko closed 2 weeks ago
I would like to have the volumes/models/ folder which we used for local models. Please take a look to the docker-compose.yaml Here are the lines where we saved model volumes:
We will have two type of models. .safetensors and .gguf in the future. The first one will be the customised model and it can be used for CPT. gguf for inference on CPU.
We should add them to the folder under volumes
The chat API currently use llamacpp. However, I comment the code like download models and score. We need to discuss these features, let them work well with new inference method.
Below is the test log, I use dev container development environment on AWS. The model I am using is gpt2-minimal
prompt below
{
"accountID": 0,
"sessionId": 0,
"message": "Building a website can be done in 10 simple steps:"
}
Response
2024-06-27 05:29:38.539 | INFO | src.repository.rag.chat:inference:199 - inference answer value:{'content': '\n\n1) Request the .html on the web\n\n2) Add an HTML element and edit the "Content-Type" for it.\n\n3) Type in a new "Content-Type" that is the same as the "Content-Image" you downloaded from the server.\n\n4) Within a few seconds, the "Content-Type" that you clicked on is the same as the "Content-Image" that is on your desktop or a Google Sheet.\n\nIn the future you can recreate the content and edit the content using this template.\n\nI hope this has helped, and if not', 'id_slot': 0, 'stop': True, 'model': 'models/gpt2-minimal-Q4_K_M-v2.gguf', 'tokens_predicted': 128, 'tokens_evaluated': 11, 'generation_settings': {'n_ctx': 512, 'n_predict': -1, 'model': 'models/gpt2-minimal-Q4_K_M-v2.gguf', 'seed': 4294967295, 'temperature': 0.800000011920929, 'dynatemp_range': 0.0, 'dynatemp_exponent': 1.0, 'top_k': 40, 'top_p': 0.949999988079071, 'min_p': 0.05000000074505806, 'tfs_z': 1.0, 'typical_p': 1.0, 'repeat_last_n': 64, 'repeat_penalty': 1.0, 'presence_penalty': 0.0, 'frequency_penalty': 0.0, 'penalty_prompt_tokens': [], 'use_penalty_prompt_tokens': False, 'mirostat': 0, 'mirostat_tau': 5.0, 'mirostat_eta': 0.10000000149011612, 'penalize_nl': False, 'stop': [], 'n_keep': 0, 'n_discard': 0, 'ignore_eos': False, 'stream': False, 'logit_bias': [], 'n_probs': 0, 'min_keep': 0, 'grammar': '', 'samplers': ['top_k', 'tfs_z', 'typical_p', 'top_p', 'min_p', 'temperature']}, 'prompt': 'Building a website can be done in 10 simple steps:', 'truncated': False, 'stopped_eos': False, 'stopped_word': False, 'stopped_limit': True, 'stopping_word': '', 'tokens_cached': 138, 'timings': {'prompt_n': 11, 'prompt_ms': 15.344, 'prompt_per_token_ms': 1.3949090909090909, 'prompt_per_second': 716.8925964546403, 'predicted_n': 128, 'predicted_ms': 444.658, 'predicted_per_token_ms': 3.473890625, 'predicted_per_second': 287.8616824615772}}
INFO: 127.0.0.1:55970 - "POST /api/chat HTTP/1.1" 201 Created
Currently, the RAG feature is unable to use. I'm thinking we need an efficient encoder and decoder that support both CPU and GPU. And I'm working on it.
@Micost Is ok we merge this one first? (we do not release version only merge the PR) Currently, the deployment process without download any model. And we can focus on the CRM features with @cbh778899 I mean it can be easier for him deploy backend to his local env.
Description
This PR
fixed: #153 fixed: #162
And depends on https://github.com/SkywardAI/containers/pull/32
Notes for Reviewers
I'm trying to add customise llama.cpp image to the backend as the inference engine. It has some advantages:
And also a new API
version
was added in this PRIn the future, I have a plan to bounding or replacing it with Rust, but currently it can provides us smoothly inference experience on CPU.
Test commands
For more parameters: https://github.com/SkywardAI/llama.cpp/tree/master/examples/server
Inference on 8 CPU with a reasonable speed
Models address:
Question
Shall we use
volume
? I don't see any pre-defined docker volume. I'd like to add a small model 110MB size to the repo. It can be tracked with gitlfs
. And after execution the make up command, docker compose will load the models from local. What is your opinion? @MicostSigned commits