Bug: ggml_backend_metal_buffer_type_alloc_buffer: error: failed to allocate buffer

sunnsi commented 1 month ago

What happened?

Build with

cmake .. -DGGML_METAL=on -DGGML_RPC=ON
cmake --build . --config Release

Logs of rpc-server


(base) ➜  build-rpc ./bin/rpc-server -p 50051 -H 0.0.0.0 -m 154000

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! WARNING: Host ('0.0.0.0') is != '127.0.0.1' Never expose the RPC server to an open network! This is an experimental feature and is not secure! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

create_backend: using Metal backend ggml_metal_init: allocating ggml_metal_init: found device: Apple M2 Ultra ggml_metal_init: picking default device: Apple M2 Ultra ggml_metal_init: using embedded metal library ggml_metal_init: GPU name: Apple M2 Ultra ggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008) ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003) ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001) ggml_metal_init: simdgroup reduction support = true ggml_metal_init: simdgroup matrix mul. support = true ggml_metal_init: hasUnifiedMemory = true ggml_metal_init: recommendedMaxWorkingSetSize = 154618.82 MB Starting RPC server on 0.0.0.0:50051, backend memory: 154000 MB


3. Logs of llama-cli
```bash
(base) ➜  build-rpc ~/Softwares/llama.cpp-b3720/build-rpc/bin/llama-cli -m ~/Softwares/llm-models/NousResearch/Hermes-3-Llama-3.1-8B-GGUF/Hermes-3-Llama-3.1-8B.Q8_0.gguf -p "Hello ,my name is" --repeat-penalty 1.0 -n 64 --rpc 169.254.151.33:50051 -ngl 99

...
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    0.27 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =   532.31 MiB
llm_load_tensors: RPC[169.254.151.33:50051] buffer size =  7605.34 MiB
.........................................................................................
llama_new_context_with_model: n_ctx      = 131072
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2 Ultra
ggml_metal_init: picking default device: Apple M2 Ultra
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name:   Apple M2 Ultra
ggml_metal_init: GPU family: MTLGPUFamilyApple8  (1008)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 154618.82 MB
llama_kv_cache_init: RPC[169.254.151.33:50051] KV buffer size = 16384.00 MiB
llama_new_context_with_model: KV self size  = 16384.00 MiB, K (f16): 8192.00 MiB, V (f16): 8192.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
ggml_backend_metal_buffer_type_alloc_buffer: error: failed to allocate buffer, size =     0.00 MiB
ggml_gallocr_reserve_n: failed to allocate Metal buffer of size 0
llama_new_context_with_model: failed to allocate compute buffers
ggml_metal_free: deallocating
llama_init_from_gpt_params: error: failed to create context with model '/Users/mac527a/Softwares/llm-models/NousResearch/Hermes-3-Llama-3.1-8B-GGUF/Hermes-3-Llama-3.1-8B.Q8_0.gguf'
main: error: unable to load model

Furether information: When I delete '--rpc 169.254.151.33:50051' or just set '-ngl 0', llama-cli runs correctly and gives the generated context.

Name and Version

(base) ➜ build-rpc ./bin/llama-cli --version version: 0 (unknown) built with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin23.6.0

llama-cli is built on the source code of llama.cpp-b3720

What operating system are you seeing the problem on?

Mac

Relevant log output

No response

ggerganov commented 1 month ago

Does #9466 fix the issue?

sunnsi commented 1 month ago

Does #9466 fix the issue?

Yes, everything is working fine now. Thank you for your quick reply and effort.

ggerganov / llama.cpp