Segmentation Fault 11 on M2 Ultra 192GB when offloading more than 110GB into Metal

SomeOddCodeGuy commented 7 months ago

This issue is occurring on an Mac Studio with an M2 Ultra process and 192GB of RAM.
I originally hit the issue when attempting to load a 120b model using the latest version of Oobabooga. I then pulled the latest version of llama.cpp (b2167) to confirm that the issue persisted. The issue is present in both, and both act identically, so I will be using my output from Oobabooga below. But I did confirm the exact same load settings cause the same fail output in llama.cpp b2167.
When loading a model in llama.cpp, I always set no-mmap and mlock.
A few weeks back, using the sudo command, I increased my mac's usable vram from 147GB to 170GB. Until now this has not posed a problem, as best as I can tell. But I wanted to point out that it is not a new thing.

My Issue:

Up until today, I was using a version of Oobabooga from mid-December, and was successfully running 120b models with full metal offload. Today I updated Oobabooga to the latest version, and with it came a newer version of Llama.cpp.

Up until now, Llama.cpp on the Mac used either 0 or 1 for ngl; 0 off, 1 on. This version now respects the ngl flag completely, and a 120b model now can manually offload 141 layers on the Mac.

On the previous version of Llama.cpp, and all versions up until now, I've been able to load a 120b completely into the metal working space without issue. As of some recent version released since mid-December, I am now unable to increase the Metal memory usage past ~110GB.

Something odd happens when I attempt to offload more layers after hitting 110GB of usage; the console output looks like it splits out a new metal buffer line and then crashes. On a 120b model, this means the cut-off is 127 layers. If I go past that to 128, it crashes. I have attempted going up to 141 and even 256 layers; same result.

Example of Successful Load on older Llama.cpp from mid-December, with ngl set to 1

llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32764
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 140
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 28672
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32764
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 120.32 B
llm_load_print_meta: model size       = 119.06 GiB (8.50 BPW) 
llm_load_print_meta: general.name     = D:\text-generation-webui-main\models
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 121920.51 MiB
llm_load_tensors: mem required  = 121920.51 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 16384
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: KV self size  = 8960.00 MiB, K (f16): 4480.00 MiB, V (f16): 4480.00 MiB
llama_build_graph: non-view tensors processed: 2944/2944
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2 Ultra
ggml_metal_init: picking default device: Apple M2 Ultra
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/user/text-generation-webui-main-2/installer_files/env/lib/python3.11/site-packages/llama_cpp/ggml-metal.metal'
ggml_metal_init: GPU name:   Apple M2 Ultra
ggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008)
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 178257.92 MB

Example of Failed Load on new Llama.cpp, offloading 128 out of 141 layers.

llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32764
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 140
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 28672
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32764
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 120.32 B
llm_load_print_meta: model size       = 119.06 GiB (8.50 BPW) 
llm_load_print_meta: general.name     = D:\text-generation-webui-main\models
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.96 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 110508.00 MiB, (110508.06 / 170000.00)
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =   476.00 MiB, (110984.06 / 170000.00)
/bin/sh: line 1: 77807 Segmentation fault: 11

Example of Successful Load on new Llama.cpp, offloading 127 out of 141 layers.

llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32764
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 140
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 28672
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32764
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 120.32 B
llm_load_print_meta: model size       = 119.06 GiB (8.50 BPW) 
llm_load_print_meta: general.name     = D:\text-generation-webui-main\models
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.96 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 110116.94 MiB, (110117.00 / 170000.00)
llm_load_tensors: offloading 127 repeating layers to GPU
llm_load_tensors: offloaded 127/141 layers to GPU
llm_load_tensors:        CPU buffer size = 11803.09 MiB
llm_load_tensors:      Metal buffer size = 110116.94 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 16384
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2 Ultra
ggml_metal_init: picking default device: Apple M2 Ultra
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/user/text-generation-webui-main/installer_files/env/lib/python3.11/site-packages/llama_cpp/ggml-metal.metal'
ggml_metal_init: GPU name:   Apple M2 Ultra
ggml_metal_init: GPU family: MTLGPUFamilyApple8  (1008)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 178257.92 MB

dreambottle commented 7 months ago

Assuming you use the same command as I do, sudo sysctl iogpu.wired_limit_mb=29500 (with your specific number), you have to do it every time after a reboot, it does not persist.

SomeOddCodeGuy commented 7 months ago

heh I might not reboot as often as I should... it's a headless mac that I use a server in my house, so it can go more than a week or two without reboot. I know that I should reboot more, but I honestly haven't experienced performance issues doing that.

Just to confirm, though, I did just now reboot it and you are correct: I'm back to 147GB.

Also to confirm, I retried the scenario stated in the ticket above, and the issue is persisting even with the original working set size..

llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32764
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 140
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 28672
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32764
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 120.32 B
llm_load_print_meta: model size       = 119.06 GiB (8.50 BPW) 
llm_load_print_meta: general.name     = D:\text-generation-webui-main\models
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.96 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 110508.00 MiB, (110508.38 / 147456.00)
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =   476.00 MiB, (110984.38 / 147456.00)
/bin/sh: line 1:   822 Segmentation fault: 11

SomeOddCodeGuy commented 7 months ago

Note: This error does not occur in Koboldcpp version 1-58, which was showing being 36 commits behind llama.cpp last I looked. I am able to load the full layers up to 155b (when using the command to increase vram to 170GB) without issue.

github-actions[bot] commented 5 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.

ggerganov / llama.cpp