Open mostlygeek opened 11 hours ago
This is a tricky issue and not likely to be fixed soon, but you can still use -ts
to skip a GPU with -sm row
.
It works. Thanks.
./llama-server -m /mnt/nvme/models/Qwen2.5-Coder-32B-Instruct-Q4_K_M.gguf \
-md /mnt/nvme/models/Qwen2.5-Coder-0.5B-Instruct-Q4_K_M.gguf \
-ngl 99 -ngld 99 -fa --port 9999 -c 4096 --draft-max 16 --draft-min 1 \
--device CUDA1,CUDA2,CUDA3 --device-draft CUDA0 \
-ts 0,1,1,1 -sm row
With this:
$ curl --url http://localhost:9999/v1/chat/completions \
-d '{"messages": [{"role": "system", "content": "you only write code."}, {"role": "user", "content": "write snake game in js"}], "temperature": 0.1}'
It goes eval went from 16.32 tok/sec to 30.82 tok/sec!
Name and Version
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-server
Problem description & steps to reproduce
The new
--device
flag does not work with-sm row
.Devices:
When running with this command:
The main model gets split across as expected across the P40s and the draft model on the 3090. However adding
-sm row
the main model gets split across all 4 GPUs instead of just the P40s.First Bad Commit
likely introduced with #10497 that introduced
--device
and--device-draft
Relevant log output
No response