ggerganov / llama.cpp

LLM inference in C/C++
MIT License
68.07k stars 9.76k forks source link

Bug: ggml_vulkan: Failed to allocate pinned memory. #9271

Closed yurivict closed 3 weeks ago

yurivict commented 2 months ago

What happened?

llama-cpp prints this error when larger models are imported:

ggml_vulkan: Failed to allocate pinned memory.
ggml_vulkan: vk::Device::allocateMemory: ErrorOutOfDeviceMemory

The complete log is:

$ llama-server -m llama-2-7b-chat.Q4_K_M.gguf --host 0.0.0.0 --port 9011
INFO [                    main] build info | tid="0x368162012000" timestamp=1725256043 build=0 commit="unknown"
INFO [                    main] system info | tid="0x368162012000" timestamp=1725256043 n_threads=4 n_threads_batch=4 total_threads=8 system_info="AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
INFO [                    main] HTTP server is listening | tid="0x368162012000" timestamp=1725256043 port="9011" n_threads_http="7" hostname="0.0.0.0"
INFO [                    main] loading model | tid="0x368162012000" timestamp=1725256043 port="9011" n_threads_http="7" hostname="0.0.0.0"
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from models/llama-2-7b-chat.Q4_K_M.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  10:                          general.file_type u32              = 15
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  17:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  18:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
llm_load_vocab: special tokens cache size = 3
llm_load_vocab: token to piece cache size = 0.1684 MB
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 6.74 B
llm_load_print_meta: model size       = 3.80 GiB (4.84 BPW) 
llm_load_print_meta: general.name     = LLaMA v2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_print_meta: max token length = 48
ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: NVIDIA GeForce RTX 2060 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32
INFO [      log_server_request] request | tid="0x368162a0aa00" timestamp=1725256058 remote_addr="127.0.0.1" remote_port=40365 status=503 method="GET" path="/" params={}
INFO [      log_server_request] request | tid="0x368162a0aa00" timestamp=1725256059 remote_addr="127.0.0.1" remote_port=40365 status=503 method="GET" path="/favicon.ico" params={}
INFO [      log_server_request] request | tid="0x368162a0aa00" timestamp=1725256065 remote_addr="127.0.0.1" remote_port=16106 status=503 method="GET" path="/api" params={}
INFO [      log_server_request] request | tid="0x368162a0aa00" timestamp=1725256065 remote_addr="127.0.0.1" remote_port=16106 status=503 method="GET" path="/favicon.ico" params={}
INFO [      log_server_request] request | tid="0x368162a0aa00" timestamp=1725256076 remote_addr="127.0.0.1" remote_port=24803 status=503 method="GET" path="/chat" params={}
INFO [      log_server_request] request | tid="0x368162a0aa00" timestamp=1725256076 remote_addr="127.0.0.1" remote_port=24803 status=503 method="GET" path="/favicon.ico" params={}
INFO [      log_server_request] request | tid="0x368162a0aa00" timestamp=1725256092 remote_addr="127.0.0.1" remote_port=20437 status=503 method="GET" path="/" params={}
INFO [      log_server_request] request | tid="0x368162a0aa00" timestamp=1725256092 remote_addr="127.0.0.1" remote_port=20437 status=503 method="GET" path="/favicon.ico" params={}
llm_load_tensors: ggml ctx size =    0.14 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/33 layers to GPU
llm_load_tensors:        CPU buffer size =  3891.24 MiB
..................................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
ggml_vulkan: Failed to allocate pinned memory.
ggml_vulkan: vk::Device::allocateMemory: ErrorOutOfDeviceMemory
llama_kv_cache_init:        CPU KV buffer size =  2048.00 MiB
llama_new_context_with_model: KV self size  = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_new_context_with_model: Vulkan_Host  output buffer size =     0.24 MiB
llama_new_context_with_model: NVIDIA GeForce RTX 2060 compute buffer size =   353.00 MiB
llama_new_context_with_model: Vulkan_Host compute buffer size =    24.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 356
INFO [                    init] initializing slots | tid="0x368162012000" timestamp=1725256219 n_slots=1
INFO [                    init] new slot | tid="0x368162012000" timestamp=1725256219 id_slot=0 n_ctx_slot=4096
INFO [                    main] model loaded | tid="0x368162012000" timestamp=1725256219
INFO [                    main] chat template | tid="0x368162012000" timestamp=1725256219 chat_example="<|im_start|>system\nYou are a helpful assistant<|im_end|>\n<|im_start|>user\nHello<|im_end|>\n<|im_start|>assistant\nHi there<|im_end|>\n<|im_start|>user\nHow are you?<|im_end|>\n<|im_start|>assistant\n" built_in=true
INFO [            update_slots] all slots are idle | tid="0x368162012000" timestamp=1725256219
INFO [      log_server_request] request | tid="0x368162a0aa00" timestamp=1725256262 remote_addr="127.0.0.1" remote_port=14218 status=200 method="GET" path="/" params={}
INFO [      log_server_request] request | tid="0x368162a0aa00" timestamp=1725256262 remote_addr="127.0.0.1" remote_port=14218 status=200 method="GET" path="/index.js" params={}
INFO [      log_server_request] request | tid="0x368162a0a300" timestamp=1725256262 remote_addr="127.0.0.1" remote_port=20125 status=200 method="GET" path="/completion.js" params={}
INFO [      log_server_request] request | tid="0x368162a0aa00" timestamp=1725256262 remote_addr="127.0.0.1" remote_port=14218 status=200 method="GET" path="/json-schema-to-grammar.mjs" params={}
INFO [      log_server_request] request | tid="0x368162a0a300" timestamp=1725256263 remote_addr="127.0.0.1" remote_port=20125 status=404 method="GET" path="/favicon.ico" params={}
INFO [      log_server_request] request | tid="0x368162a0a300" timestamp=1725256270 remote_addr="127.0.0.1" remote_port=53098 status=200 method="GET" path="/index-new.html" params={}
INFO [      log_server_request] request | tid="0x368162a0a300" timestamp=1725256270 remote_addr="127.0.0.1" remote_port=53098 status=200 method="GET" path="/style.css" params={}
INFO [      log_server_request] request | tid="0x368162a0aa00" timestamp=1725256270 remote_addr="127.0.0.1" remote_port=45530 status=200 method="GET" path="/index.js" params={}
INFO [      log_server_request] request | tid="0x368162a0aa00" timestamp=1725256270 remote_addr="127.0.0.1" remote_port=45530 status=200 method="GET" path="/completion.js" params={}
INFO [      log_server_request] request | tid="0x368162a0a300" timestamp=1725256270 remote_addr="127.0.0.1" remote_port=53098 status=200 method="GET" path="/json-schema-to-grammar.mjs" params={}
INFO [      log_server_request] request | tid="0x368162a0a300" timestamp=1725256270 remote_addr="127.0.0.1" remote_port=53098 status=200 method="GET" path="/system-prompts.js" params={}
INFO [      log_server_request] request | tid="0x368162a09c00" timestamp=1725256270 remote_addr="127.0.0.1" remote_port=52779 status=200 method="GET" path="/prompt-formats.js" params={}
INFO [      log_server_request] request | tid="0x368162a0aa00" timestamp=1725256270 remote_addr="127.0.0.1" remote_port=45530 status=200 method="GET" path="/colorthemes.css" params={}
INFO [      log_server_request] request | tid="0x368162a0aa00" timestamp=1725256270 remote_addr="127.0.0.1" remote_port=45530 status=200 method="GET" path="/theme-snowstorm.css" params={}
INFO [      log_server_request] request | tid="0x368162a09c00" timestamp=1725256270 remote_addr="127.0.0.1" remote_port=52779 status=200 method="GET" path="/theme-polarnight.css" params={}
INFO [      log_server_request] request | tid="0x368162a0a300" timestamp=1725256270 remote_addr="127.0.0.1" remote_port=53098 status=200 method="GET" path="/theme-ketivah.css" params={}
INFO [      log_server_request] request | tid="0x368162a0a300" timestamp=1725256270 remote_addr="127.0.0.1" remote_port=37369 status=200 method="GET" path="/theme-mangotango.css" params={}
INFO [      log_server_request] request | tid="0x368162a08e00" timestamp=1725256270 remote_addr="127.0.0.1" remote_port=64090 status=200 method="GET" path="/theme-playground.css" params={}
INFO [      log_server_request] request | tid="0x368162a09500" timestamp=1725256270 remote_addr="127.0.0.1" remote_port=33068 status=200 method="GET" path="/theme-beeninorder.css" params={}
INFO [   launch_slot_with_task] slot is processing task | tid="0x368162012000" timestamp=1725256293 id_slot=0 id_task=0
INFO [            update_slots] kv cache rm [p0, end) | tid="0x368162012000" timestamp=1725256293 id_slot=0 id_task=0 p0=0
INFO [           print_timings] prompt eval time     =   37733.89 ms /    57 tokens (  662.00 ms per token,     1.51 tokens per second) | tid="0x368162012000" timestamp=1725256440 id_slot=0 id_task=0 t_prompt_processing=37733.892 n_prompt_tokens_processed=57 t_token=661.9981052631579 n_tokens_second=1.5105783416139527
INFO [           print_timings] generation eval time =  108879.04 ms /    16 runs   ( 6804.94 ms per token,     0.15 tokens per second) | tid="0x368162012000" timestamp=1725256440 id_slot=0 id_task=0 t_token_generation=108879.043 n_decoded=16 t_token=6804.9401875 n_tokens_second=0.14695206312568343
INFO [           print_timings]           total time =  146612.93 ms | tid="0x368162012000" timestamp=1725256440 id_slot=0 id_task=0 t_prompt_processing=37733.892 t_token_generation=108879.043 t_total=146612.935
INFO [            update_slots] slot released | tid="0x368162012000" timestamp=1725256440 id_slot=0 id_task=0 n_ctx=4096 n_past=72 n_system_tokens=0 n_cache_tokens=72 truncated=false
INFO [            update_slots] all slots are idle | tid="0x368162012000" timestamp=1725256440
INFO [      log_server_request] request | tid="0x368162a09500" timestamp=1725256440 remote_addr="127.0.0.1" remote_port=29397 status=200 method="POST" path="/completion" params={}
INFO [            update_slots] all slots are idle | tid="0x368162012000" timestamp=1725256440

INFO [      log_server_request] request | tid="0x368162a09500" timestamp=1725260525 remote_addr="127.0.0.1" remote_port=43916 status=200 method="GET" path="/" params={}
INFO [      log_server_request] request | tid="0x368162a09500" timestamp=1725260526 remote_addr="127.0.0.1" remote_port=43916 status=200 method="GET" path="/index.js" params={}
INFO [      log_server_request] request | tid="0x368162a09500" timestamp=1725260526 remote_addr="127.0.0.1" remote_port=43916 status=200 method="GET" path="/completion.js" params={}
INFO [      log_server_request] request | tid="0x368162a08e00" timestamp=1725260526 remote_addr="127.0.0.1" remote_port=17145 status=200 method="GET" path="/json-schema-to-grammar.mjs" params={}
INFO [   launch_slot_with_task] slot is processing task | tid="0x368162012000" timestamp=1725260540 id_slot=0 id_task=18
INFO [            update_slots] kv cache rm [p0, end) | tid="0x368162012000" timestamp=1725260540 id_slot=0 id_task=18 p0=47
INFO [           print_timings] prompt eval time     =   65096.22 ms /    10 tokens ( 6509.62 ms per token,     0.15 tokens per second) | tid="0x368162012000" timestamp=1725261159 id_slot=0 id_task=18 t_prompt_processing=65096.218 n_prompt_tokens_processed=10 t_token=6509.6218 n_tokens_second=0.1536187555473653
INFO [           print_timings] generation eval time =  554611.70 ms /    39 runs   (14220.81 ms per token,     0.07 tokens per second) | tid="0x368162012000" timestamp=1725261159 id_slot=0 id_task=18 t_token_generation=554611.703 n_decoded=39 t_token=14220.812897435897 n_tokens_second=0.07031946817754042
INFO [           print_timings]           total time =  619707.92 ms | tid="0x368162012000" timestamp=1725261159 id_slot=0 id_task=18 t_prompt_processing=65096.218 t_token_generation=554611.703 t_total=619707.921
INFO [      log_server_request] request | tid="0x368162a08e00" timestamp=1725261160 remote_addr="127.0.0.1" remote_port=32324 status=200 method="POST" path="/completion" params={}
INFO [            update_slots] slot released | tid="0x368162012000" timestamp=1725261160 id_slot=0 id_task=18 n_ctx=4096 n_past=95 n_system_tokens=0 n_cache_tokens=95 truncated=false
INFO [            update_slots] all slots are idle | tid="0x368162012000" timestamp=1725261160

Name and Version

$ llama-cli --version
version: 0 (unknown)
built with FreeBSD clang version 18.1.5 (https://github.com/llvm/llvm-project.git llvmorg-18.1.5-0-g617a15a9eac9) for x86_64-unknown-freebsd14.1

FreeBSD 14.1

What operating system are you seeing the problem on?

No response

Relevant log output

No response

0cc4m commented 2 months ago

Vulkan attempts to pin the allocated CPU memory (basically allocate it in a way so that it can be transferred into VRAM without needing a memcpy to a staging buffer), this may fail if the driver doesn't provide enough host memory.

You can check how much host memory your driver provides with vulkaninfo, look at the memory heaps.

When there is not enough memory available the allocation fails, prints this error message and falls back to a regular CPU allocation. It makes a small difference for performance, but you can safely ignore the error.

avdg commented 2 months ago

I had a similar issue too, but it was with a model (llama 3.1 to be exactly) that could accept large context and I assume the big context size was also the default value.

All I had to do is to use --ctx_size=16384 to fix the issue (the model had a way bigger default context size and yes, this context size is bigger than what is needed in the models above).

Hopefully you can find other tips on how to reduce memory consumption.

CompulsiveComputing commented 2 months ago

I trialed building llama.cpp today and ended up with the same issue yuri mentioned. (On windows, 64GB ram and 16GB vram gpu)

After trying to track down the issue myself, it seems related to vulkan (sometimes?) having an allocation limit of 4GB (per allocation) as it shows on my machine when I use vulkaninfo.

It is only when using a combined model+context that (seemingly) comes in under 4GB that I do not receive said error.

I tried models of size 2.5GB, 5.5GB, 12GB, 17GB. Only the 2.5GB model plus a small 1024 context length resulted in no memory related errors.

edit: now I am less sure. I can finagle the context up as high as 20000 without it giving the memory error, so long as I specify the number of layers to be offloaded

These will work llama-server.exe -m .\models\llama-2-7b.Q2_K.gguf --port 8080 --gpu-layers 33 llama-server.exe -m .\models\llama-2-7b.Q2_K.gguf --port 8080 -c 20000 --gpu-layers 33 llama-server.exe -m .\models\llama-2-7b.Q2_K.gguf --port 8080 -c 1024 llama-server.exe -m .\models\llama-2-7b.Q2_K.gguf --port 8080 -c 3072

Meanwhile this won't llama-server.exe -m .\models\llama-2-7b.Q2_K.gguf --port 8080 llama-server.exe -m .\models\llama-2-7b.Q2_K.gguf --port 8080 -c 4096

Edit 2: Seems like explicitly telling it to offload X layers, even when that is all of the layers, and the context length is specified, so long as that total comes to less than VRAM, it works. I assume it's making one allocation per layer when told to do so explicitly, but am not sure how to check.

For instance with a 17.5GB (43 offload-able layers) model on a 16GB card works: llama-server.exe -m .\models\codellama-34b.Q4_0.gguf --port 8080 --gpu-layers 30 -c 4096 llama-server.exe -m .\models\codellama-34b.Q4_0.gguf --port 8080 --gpu-layers 38 -c 512

fails: llama-server.exe -m .\models\codellama-34b.Q4_0.gguf --port 8080 --gpu-layers 43 -c 512 llama-server.exe -m .\models\codellama-34b.Q4_0.gguf --port 8080 --gpu-layers 38 -c 1024

github-actions[bot] commented 3 weeks ago

This issue was closed because it has been inactive for 14 days since being marked as stale.