Open alabulei1 opened 4 months ago
According to the log message below, the model name in the request was incorrectly set to internlm2_5-7b-chat
. The correct name is glm-4-9b-chat
that was set to --model-name
option in the command.
[2024-07-11 07:50:12.036] [wasi_logging_stdout] [error] llama_core: llama_core::chat in llama-core/src/chat.rs:1349: The model `internlm2_5-7b-chat` does not exist in the chat graphs.
In case there is something wrong with the chatbot UI, I send an API request to the model
curl -X POST http://localhost:8080/v1/chat/completions \
-H 'accept:application/json' \
-H 'Content-Type: application/json' \
-d '{"messages":[{"role":"system", "content": "You are a helpful assistant."}, {"role":"user", "content": "Plan me a two day trip in Paris"}], "model":"glm-4-9b-chat"}'
curl: (52) Empty reply from server
The log is as follows:
[2024-07-11 09:59:04.506] [info] [WASI-NN] llama.cpp: llama_new_context_with_model: CUDA_Host output buffer size = 0.58 MiB
[2024-07-11 09:59:04.518] [info] [WASI-NN] llama.cpp: llama_new_context_with_model: CUDA0 compute buffer size = 304.00 MiB
[2024-07-11 09:59:04.518] [info] [WASI-NN] llama.cpp: llama_new_context_with_model: CUDA_Host compute buffer size = 12.01 MiB
[2024-07-11 09:59:04.518] [info] [WASI-NN] llama.cpp: llama_new_context_with_model: graph nodes = 1606
[2024-07-11 09:59:04.518] [info] [WASI-NN] llama.cpp: llama_new_context_with_model: graph splits = 2
[2024-07-11 09:59:04.743] [error] [WASI-NN] llama.cpp: CUDA error: an illegal memory access was encountered
[2024-07-11 09:59:04.743] [error] [WASI-NN] llama.cpp: current device: 0, in function launch_mul_mat_q at /__w/wasi-nn-ggml-plugin/wasi-nn-ggml-plugin/build/_deps/llama-src/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2602
[2024-07-11 09:59:04.743] [error] [WASI-NN] llama.cpp: cudaFuncSetAttribute(mul_mat_q<type, mmq_x, 8, false>, cudaFuncAttributeMaxDynamicSharedMemorySize, shmem)
GGML_ASSERT: /__w/wasi-nn-ggml-plugin/wasi-nn-ggml-plugin/build/_deps/llama-src/ggml/src/ggml-cuda.cu:101: !"CUDA error"
Aborted (core dumped)
A cuda error was triggered. It seems like a memory issue. @hydai Could you please help with the issue? Thanks!
Is this model supported by llama.cpp?
Yes.
Could you please try llama.cpp with CUDA enabled to run this model? Since this is an internal error in the llama.cpp CUDA backend, I would like to know if this is an upstream issue.
When I run glm-4-9b-chat-Q5_K_M.gguf on the Cuda 12 machine, the API server can be started successfully. However, when I send a question, the API server will crash.
The command I used to start the API server is as follows:
Here is the error message
Versions