Closed rajesh-s closed 5 days ago
Hm, the option should already be supported by the batched-bench
example. What makes you think it does not work?
With --no-kv-offload
specified I see the following memory related prints in the log:
#cmd: ./build/bin/llama-batched-bench -m models/llama-2-7b-Q4_0_4_8_aarch64.gguf --no-display-prompt --ignore-eos -fa -b 8192 -ub 8192 -c 8192 -npp 256 -ntg 256 -npl 16 -ngl 0 -t 16 --no-kv-offload
Device 0: NVIDIA GH200 120GB, compute capability 9.0, VMM: yes
llm_load_tensors: ggml ctx size = 0.14 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/33 layers to GPU
llm_load_tensors: CPU buffer size = 3647.87 MiB
..................................................................................................
llama_new_context_with_model: n_ctx = 8192
llama_new_context_with_model: n_batch = 8192
llama_new_context_with_model: n_ubatch = 8192
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA_Host KV buffer size = 4096.00 MiB
llama_new_context_with_model: KV self size = 4096.00 MiB, K (f16): 2048.00 MiB, V (f16): 2048.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 1.95 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 1230.54 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 1072.09 MiB
llama_new_context_with_model: graph nodes = 903
llama_new_context_with_model: graph splits = 196
There is no notable difference in the memory allocation on the indicated CUDA compute buffer size
from the default run
#cmd: ./build/bin/llama-batched-bench -m models/llama-2-7b-Q4_0_4_8_aarch64.gguf --no-display-prompt --ignore-eos -fa -b 8192 -ub 8192 -c 8192 -npp 256 -ntg 256 -npl 16 -ngl 0 -t 16
llm_load_tensors: ggml ctx size = 0.14 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/33 layers to GPU
llm_load_tensors: CPU buffer size = 3647.87 MiB
..................................................................................................
llama_new_context_with_model: n_ctx = 8192
llama_new_context_with_model: n_batch = 8192
llama_new_context_with_model: n_ubatch = 8192
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA_Host KV buffer size = 4096.00 MiB
llama_new_context_with_model: KV self size = 4096.00 MiB, K (f16): 2048.00 MiB, V (f16): 2048.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 1.95 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 1230.54 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 1072.09 MiB
llama_new_context_with_model: graph nodes = 903
llama_new_context_with_model: graph splits = 196
This makes me believe that the flag is not actually changing the memory allocation policy for the KV cache. Furthermore, sweeping different CPU/GPU offloading configurations on a GH200 does not show perceivable difference (especially for 0 GPU layer offloading where CPU local memory should outperform GPU memory) even though the GPU/CPU memory have distinct memory performance characteristics
Additionally, params.no_kv_offload
is not being parsed in batched-bench.cpp like it is in llama-bench.cpp
The no_kv_offload
is parsed in batched-bench.cpp
indirectly via the call to llama_context_params_from_gpt_params()
:
AFAIK, the KV cache is allocated in "pinned" host memory. This is what the CUDA_Host KV buffer
conveys. Such pinned memory supports very fast asynchronous memory transfer. This is always done when CUDA is enabled, regardless if the KV cache is offloaded or not or how many layers are offloaded.
The CUDA compute buffer
is normal to not be affected. It indicates the extra VRAM necessary to store intermediate results during the graph computation. Therefore, it is not related to the amount of memory required by the model or the KV cache.
To inspect which operations are performed on the CPU and which on the GPU with and without the -nkvo
argument, you can set the GGML_SCHED_DEBUG
environment variable and look at the difference between the logs.
./build/bin/llama-batched-bench -m models/llama-2-7b-Q4_0_4_8_aarch64.gguf
In your specific case, you probably don't observe much difference in the performance, because this looks like a Q4_0_4_8
quantization which is currently only implemented for Arm CPUs. So regardless of how many layers/KV cache you specify to offload, huge part of the computation is running on the CPU.
This issue was closed because it has been inactive for 14 days since being marked as stale.
Prerequisites
Feature Description
The KV buffers are always allocated on the GPU by default (even when there are no layers offloaded i.e.
-ngl 0
). This can be disabled with the--no-kv-offload
option as discussed here, but this option is currently not implemented inbatched-bench.cpp
. Doing so would help profile without the use of the GPU side buffers.Motivation
llama-bench already supports this feature. Keeping it consistent with llama-batched-bench would improve the utility of batched-bench on heterogeneous systems.
Possible Implementation
No response