Feature Request: Add --no-kv-offload support for batched-bench

rajesh-s commented 2 months ago

Prerequisites

[X] I am running the latest code. Mention the version if possible as well.
[X] I carefully followed the README.md.
[X] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
[X] I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

The KV buffers are always allocated on the GPU by default (even when there are no layers offloaded i.e. -ngl 0). This can be disabled with the --no-kv-offload option as discussed here, but this option is currently not implemented in batched-bench.cpp. Doing so would help profile without the use of the GPU side buffers.

Motivation

llama-bench already supports this feature. Keeping it consistent with llama-batched-bench would improve the utility of batched-bench on heterogeneous systems.

Possible Implementation

No response

ggerganov commented 1 month ago

Hm, the option should already be supported by the batched-bench example. What makes you think it does not work?

rajesh-s commented 1 month ago

With --no-kv-offload specified I see the following memory related prints in the log:


#cmd: ./build/bin/llama-batched-bench -m models/llama-2-7b-Q4_0_4_8_aarch64.gguf --no-display-prompt --ignore-eos -fa -b 8192 -ub 8192 -c 8192 -npp 256 -ntg 256 -npl 16 -ngl 0 -t 16 --no-kv-offload 

Device 0: NVIDIA GH200 120GB, compute capability 9.0, VMM: yes
llm_load_tensors: ggml ctx size =    0.14 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/33 layers to GPU
llm_load_tensors:        CPU buffer size =  3647.87 MiB
..................................................................................................
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 8192
llama_new_context_with_model: n_ubatch   = 8192
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =  4096.00 MiB
llama_new_context_with_model: KV self size  = 4096.00 MiB, K (f16): 2048.00 MiB, V (f16): 2048.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     1.95 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =  1230.54 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =  1072.09 MiB
llama_new_context_with_model: graph nodes  = 903
llama_new_context_with_model: graph splits = 196

There is no notable difference in the memory allocation on the indicated CUDA compute buffer size from the default run

#cmd: ./build/bin/llama-batched-bench -m models/llama-2-7b-Q4_0_4_8_aarch64.gguf --no-display-prompt --ignore-eos -fa -b 8192 -ub 8192 -c 8192 -npp 256 -ntg 256 -npl 16 -ngl 0 -t 16 

llm_load_tensors: ggml ctx size =    0.14 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/33 layers to GPU
llm_load_tensors:        CPU buffer size =  3647.87 MiB
..................................................................................................
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 8192
llama_new_context_with_model: n_ubatch   = 8192
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =  4096.00 MiB
llama_new_context_with_model: KV self size  = 4096.00 MiB, K (f16): 2048.00 MiB, V (f16): 2048.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     1.95 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =  1230.54 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =  1072.09 MiB
llama_new_context_with_model: graph nodes  = 903
llama_new_context_with_model: graph splits = 196

This makes me believe that the flag is not actually changing the memory allocation policy for the KV cache. Furthermore, sweeping different CPU/GPU offloading configurations on a GH200 does not show perceivable difference (especially for 0 GPU layer offloading where CPU local memory should outperform GPU memory) even though the GPU/CPU memory have distinct memory performance characteristics

Additionally, params.no_kv_offload is not being parsed in batched-bench.cpp like it is in llama-bench.cpp

ggerganov commented 1 month ago

The no_kv_offload is parsed in batched-bench.cpp indirectly via the call to llama_context_params_from_gpt_params():

https://github.com/ggerganov/llama.cpp/blob/1f4111e540bacec8d00ca9fd96417bf4c1339394/examples/batched-bench/batched-bench.cpp#L45

AFAIK, the KV cache is allocated in "pinned" host memory. This is what the CUDA_Host KV buffer conveys. Such pinned memory supports very fast asynchronous memory transfer. This is always done when CUDA is enabled, regardless if the KV cache is offloaded or not or how many layers are offloaded.

The CUDA compute buffer is normal to not be affected. It indicates the extra VRAM necessary to store intermediate results during the graph computation. Therefore, it is not related to the amount of memory required by the model or the KV cache.

To inspect which operations are performed on the CPU and which on the GPU with and without the -nkvo argument, you can set the GGML_SCHED_DEBUG environment variable and look at the difference between the logs.

./build/bin/llama-batched-bench -m models/llama-2-7b-Q4_0_4_8_aarch64.gguf

In your specific case, you probably don't observe much difference in the performance, because this looks like a Q4_0_4_8 quantization which is currently only implemented for Arm CPUs. So regardless of how many layers/KV cache you specify to offload, huge part of the computation is running on the CPU.

github-actions[bot] commented 5 days ago

This issue was closed because it has been inactive for 14 days since being marked as stale.

ggerganov / llama.cpp