Closed renbuarl closed 1 month ago
Sorry I didn't understand the issue correctly. Since you're having 3 GPUs and each one have 24GB, so in total there should be 24*3=72GB of total memory to work with. Probably there's a problem with HIPBLAS.
@renbuarl have you tried using --flash-attn option?
@renbuarl have you tried using --flash-attn option?
Thank you! Great advice to use the '--flash-attn' option.
~/llama.cpp/llama-server -v -m /home/user/backups/models/70/Qwen2-72B-Instruct-Q4_K_M.gguf -c 65536 --host '192.168.0.5' --port 8081 -ngl 99 --flash-attn
Maximum vram consumption is 68.88 GB with a real context of 32k, and there is no 'CUDA error: out of memory'.
'CUDA error: out of memory'.
What do we have?
When launching without the --flash-attn option for llama-server
~/llama.cpp/llama-server -v -m /home/user/backups/models/70/Qwen2-72B-Instruct-Q4_K_M.gguf -c 32768 --host '192.168.0.5' --port 8081 -ngl 99 The average VRAM consumption is 68.40 GB but we crash with 'CUDA error: out of memory' with relatively small actual context.
When launching with the --flash-attn option for llama-server it works perfectly ~/llama.cpp/llama-server -v -m /home/user/backups/models/70/Qwen2-72B-Instruct-Q4_K_M.gguf -c 32768 --host '192.168.0.5' --port 8081 -ngl 99 --flash-attn The average VRAM consumption is 58.56 GB.
@renbuarl Forgot to mention, have you tried changing batch size and ubatch size?
The average VRAM consumption is 68.40 GB
According to my calculation above, then you're having 72GB in total, so it's quite reasonable that it crashes at 68.40GB being filled (due to overhead)
@renbuarl Forgot to mention, have you tried changing batch size and ubatch size?
The average VRAM consumption is 68.40 GB
According to my calculation above, then you're having 72GB in total, so it's quite reasonable that it crashes at 68.40GB being filled (due to overhead)
No, I haven't tried that yet. Could you please provide an example?
For example: -b 1024 -ub 64
Default value:
int32_t n_batch = 2048; // logical batch size for prompt processing (must be >=32 to use BLAS)
int32_t n_ubatch = 512; // physical batch size for prompt processing (must be >=32 to use BLAS)
The memory usage when running LLM (in general, not just llama.cpp), contains weight of model + KV cache + overhead for graph computation (depends on batch size), so that's why 68/72GB may not be enough (since you're not counting overhead).
For example:
-b 1024 -ub 64
@ngxson thanks, that helped too!
Here are the measurements for different options:
~/llama.cpp/llama-server -v -m /home/user/backups/models/70/Qwen2-72B-Instruct-Q4_K_M.gguf -c 32768 --host '192.168.0.5' --port 8081 -ngl 99 vram 68.40 Gb
~/llama.cpp/llama-server -v -m /home/user/backups/models/70/Qwen2-72B-Instruct-Q4_K_M.gguf -c 32768 --host '192.168.0.5' --port 8081 -ngl 99 --flash-attn vram 58.56 Gb
~/llama.cpp/llama-server -v -m /home/user/backups/models/70/Qwen2-72B-Instruct-Q4_K_M.gguf -c 32768 --host '192.168.0.5' --port 8081 -ngl 99 -b 1024 -ub 64 vram 59.04 Gb
~/llama.cpp/llama-server -v -m /home/user/backups/models/70/Qwen2-72B-Instruct-Q4_K_M.gguf -c 32768 --host '192.168.0.5' --port 8081 -ngl 99 -b 1024 -ub 64 --flash-attn vram 57.12 Gb
The best option is with the options -b 1024 -ub 64 --flash-attn.
However, it seemed to me that with -b 1024 -ub 64 it works slower.
This issue was closed because it has been inactive for 14 days since being marked as stale.
What happened?
When using multiple AMD Radeon RX 7900 XTX (ROCM) graphics cards for different models, an Out of Memory Error occurs when the context size is significantly less than the maximum. There is enough memory, but this error can also occur for the llama3.1-8b model.
In the example for the Qwen2-72B-Instruct-Q4_K_M.gguf model, the error occurs when the real context is larger than 10k.
The (old) Qwen2-72B-Instruct-Q4_K_M.gguf model was taken to exclude errors that may occur with new llama3.1 models, which are the same.
AMD Radeon RX 7900 XTX (24 GiB VRAM) AMD Radeon RX 7900 XTX (24 GiB VRAM) AMD Radeon RX 7900 XTX (24 GiB VRAM) ROCm module version: 6.7.0 amdgpu-install_6.1.60103-1_all.deb
Model: Qwen2-72B-Instruct-Q4_K_M.gguf
I built the latest release of llama.cpp #b3488 following the methodology described in https://github.com/eliranwong/MultiAMDGPU_AIDev_Ubuntu (Thanks to the author!)
git clone https://github.com/ggerganov/llama.cpp cd llama.cpp make clean && make -j4 GGML_HIPBLAS=1 AMDGPU_TARGETS=gfx1100
~/llama.cpp/llama-server -v -m /home/user/backups/models/70/Qwen2-72B-Instruct-Q4_K_M.gguf -c 32000 --host '192.168.0.5' --port 8081 -ngl 99
When the real context is more than 10k,
llama_new_context_with_model: n_ctx = 32000 llama_new_context_with_model: n_batch = 2048 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: ROCm0 KV buffer size = 3375.00 MiB llama_kv_cache_init: ROCm1 KV buffer size = 3375.00 MiB llama_kv_cache_init: ROCm2 KV buffer size = 3250.00 MiB llama_new_context_with_model: KV self size = 10000.00 MiB, K (f16): 5000.00 MiB, V (f16): 5000.00 MiB llama_new_context_with_model: ROCm_Host output buffer size = 0.98 MiB llama_new_context_with_model: pipeline parallelism enabled (n_copies=4) llama_new_context_with_model: ROCm0 compute buffer size = 4378.01 MiB llama_new_context_with_model: ROCm1 compute buffer size = 4378.01 MiB llama_new_context_with_model: ROCm2 compute buffer size = 4378.02 MiB llama_new_context_with_model: ROCm_Host compute buffer size = 266.02 MiB llama_new_context_with_model: graph nodes = 2566 llama_new_context_with_model: graph splits = 4 CUDA error: out of memory current device: 2, in function alloc at ggml/src/ggml-cuda.cu:291 ggml_cuda_device_malloc(&ptr, look_ahead_size, device) ggml/src/ggml-cuda.cu:101: CUDA error 30 ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory. 38 ./posix/waitpid.c: No such file or directory. Aborted (core dumped)
Name and Version
llama.cpp$ ~/llama.cpp/llama-server --version version: 3489 (c887d8b0) built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
What operating system are you seeing the problem on?
Linux
Relevant log output