ggerganov / llama.cpp

LLM inference in C/C++
MIT License
68.73k stars 9.87k forks source link

Bug: Server /v1/chat/completions API response's model info is wrong #10056

Open RifeWang opened 1 month ago

RifeWang commented 1 month ago

What happened?

When starting the server through a Docker image, the model must be specified; otherwise, it defaults to models/7B/ggml-model-f16.gguf, and if this is not present locally, the server will exit with an error.

However, when using the POST /v1/chat/completions API, the parameters passed also include the model, but in reality, this model parameter is not validated in any way. The response simply returns whatever the user inputs. Moreover, if the user does not input a model parameter, the response defaults to gpt-3.5-turbo-0613, which is clearly incorrect.

It is recommended to maintain consistent model information: whatever model is loaded should be the same as the model that is output.

Name and Version

REPOSITORY TAG IMAGE ID CREATED SIZE ghcr.io/ggerganov/llama.cpp server cd43d22f4e97 14 hours ago 203MB

What operating system are you seeing the problem on?

No response

Relevant log output

$ docker run -v /ai-models:/models -p 8000:8000 ghcr.io/ggerganov/llama.cpp:server --port 8000 --host 0.0.0.0 -n 512
warn: LLAMA_ARG_HOST environment variable is set, but will be overwritten by command line argument --host
build: 3978 (ff252ea4) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
system info: n_threads = 3, n_threads_batch = 3, total_threads = 3

system_info: n_threads = 3 (n_threads_batch = 3) / 3 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | AMX_INT8 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |

main: HTTP server is listening, hostname: 0.0.0.0, port: 8000, http threads: 3
main: loading model
gguf_init_from_file: failed to open 'models/7B/ggml-model-f16.gguf': 'No such file or directory'
llama_model_load: error loading model: llama_model_loader: failed to load model from models/7B/ggml-model-f16.gguf

llama_load_model_from_file: failed to load model
common_init_from_params: failed to load model 'models/7B/ggml-model-f16.gguf'
srv    load_model: failed to load model, 'models/7B/ggml-model-f16.gguf'
main: exiting due to model loading error

-----
$ curl -X POST http://127.0.0.1:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
    "model": "hgfojhhu",
    "messages": [
        {
            "role": "system",
            "content": "You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."
        },
        {
            "role": "user",
            "content": "Write a limerick about python exceptions"
        }
    ],"stream":true
}'

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"There"}}],"created":1729925470,"id":"chatcmpl-rRm3fvZvHjokJua300KMiTHfXNOtBlsj",
"model":"hgfojhhu","object":"chat.completion.chunk"}

-----
$ curl -X POST http://127.0.0.1:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
    "model": "",
    "messages": [
        {
            "role": "system",
            "content": "You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."
        },
        {
            "role": "user",
            "content": "Write a limerick about python exceptions"
        }
    ],"stream":true
}'

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"There"}}],"created":1729925508,"id":"chatcmpl-OGzkuWEqa5zbzu5sGAQaclg1XlPHLo6l",
"model":"","object":"chat.completion.chunk"}

------
$ curl -X POST http://127.0.0.1:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{

    "messages": [
        {
            "role": "system",
            "content": "You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."
        },
        {
            "role": "user",
            "content": "Write a limerick about python exceptions"
        }
    ],"stream":true
}'

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"There"}}],"created":1729925530,"id":"chatcmpl-LocVsoKFlnpeUar1OdCbOzZnZ3VoQa4c",
"model":"gpt-3.5-turbo-0613","object":"chat.completion.chunk"}
RifeWang commented 1 month ago

Another question is whether the server supports dynamically switching between different models after startup.

ngxson commented 1 month ago

The "model" field is solely for being openai-compatible and it does not reflect the real value.

Another question is whether the server supports dynamically switching between different models after startup.

No, we don't support this as we aims to make the code simple. Some other wrappers like ollama do support this by maintaining multiple instances of llama.cpp under the hood.