ggerganov / llama.cpp

LLM inference in C/C++
MIT License
68.38k stars 9.82k forks source link

Misc. bug: -sm row does not work with --device #10533

Open mostlygeek opened 11 hours ago

mostlygeek commented 11 hours ago

Name and Version

$ ./llama-server --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 4 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 1: Tesla P40, compute capability 6.1, VMM: yes
  Device 2: Tesla P40, compute capability 6.1, VMM: yes
  Device 3: Tesla P40, compute capability 6.1, VMM: yes
version: 4187 (be0e350c)
built with cc (Ubuntu 13.2.0-23ubuntu4) 13.2.0 for x86_64-linux-gnu

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-server

Problem description & steps to reproduce

The new --device flag does not work with -sm row.

Devices:

$ ./llama-server --list-devices
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 4 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 1: Tesla P40, compute capability 6.1, VMM: yes
  Device 2: Tesla P40, compute capability 6.1, VMM: yes
  Device 3: Tesla P40, compute capability 6.1, VMM: yes
Available devices:
  CUDA0: NVIDIA GeForce RTX 3090 (24154 MiB, 23892 MiB free)
  CUDA1: Tesla P40 (24438 MiB, 24290 MiB free)
  CUDA2: Tesla P40 (24438 MiB, 24290 MiB free)
  CUDA3: Tesla P40 (24438 MiB, 24290 MiB free)

When running with this command:

./llama-server -m /mnt/nvme/models/Qwen2.5-Coder-32B-Instruct-Q4_K_M.gguf \
-md /mnt/nvme/models/Qwen2.5-Coder-0.5B-Instruct-Q4_K_M.gguf \
-ngl 99 -ngld 99 -fa --port 9999 -c 4096 --draft-max 16 --draft-min 1 \
--device CUDA1,CUDA2,CUDA3 --device-draft CUDA0

The main model gets split across as expected across the P40s and the draft model on the 3090. However adding -sm row the main model gets split across all 4 GPUs instead of just the P40s.

First Bad Commit

likely introduced with #10497 that introduced --device and --device-draft

Relevant log output

No response

slaren commented 11 hours ago

This is a tricky issue and not likely to be fixed soon, but you can still use -ts to skip a GPU with -sm row.

mostlygeek commented 7 hours ago

It works. Thanks.

./llama-server -m /mnt/nvme/models/Qwen2.5-Coder-32B-Instruct-Q4_K_M.gguf \
-md /mnt/nvme/models/Qwen2.5-Coder-0.5B-Instruct-Q4_K_M.gguf \
-ngl 99 -ngld 99 -fa --port 9999 -c 4096 --draft-max 16 --draft-min 1 \
--device CUDA1,CUDA2,CUDA3 --device-draft CUDA0 \
-ts 0,1,1,1 -sm row

With this:

$ curl --url http://localhost:9999/v1/chat/completions \
-d '{"messages": [{"role": "system", "content": "you only write code."}, {"role": "user", "content": "write snake game in js"}], "temperature": 0.1}'

It goes eval went from 16.32 tok/sec to 30.82 tok/sec!