Open morgen52 opened 6 days ago
common
was recently changed a lot probably something to do with that
I think you are right. When I start the server with:
./llama.cpp-b3985/build_gpu/bin/llama-server -m ../artifact/models/Mistral-7B-Instruct-v0.3.Q4_1.gguf -ngl 31 -c 8192
It can work properly.
And when I add the --no-warmup
config:
./llama.cpp-b3985/build_gpu/bin/llama-server -m ../artifact/models/Mistral-7B-Instruct-v0.3.Q4_1.gguf -ngl 31 --no-warmup
It tells me that --no-warmup
is not a valid argument
error: invalid argument: --no-warmup
So I think this hint should be updated.
common_init_from_params : warming up the model with an empty run - please wait ... (--no-warmup to disable)
May I ask how context size affects GPU memory allocation? My understanding is that context size is just a sliding window for context length. Is memory pre-allocated based on context size?
What happened?
Hi there.
My llama-server can work well with the following command:
However, when I keep only the
ngl
parameter, my server crashes with confusing error message:I got an CUDA error: CUBLAS_STATUS_NOT_INITIALIZED:
Maybe it is a resource issue? I am not sure. Because when I try to set the
--ngl
to 32, the server crashes with a clearer error message, "cudaMalloc failed: out of memory"Name and Version
./llama.cpp-b3985/build_gpu/bin/llama-server --version ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes version: 0 (unknown) built with cc (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0 for x86_64-linux-gnu
What operating system are you seeing the problem on?
Linux
Relevant log output
No response