Closed SomeOddCodeGuy closed 5 months ago
Assuming you use the same command as I do, sudo sysctl iogpu.wired_limit_mb=29500
(with your specific number), you have to do it every time after a reboot, it does not persist.
heh I might not reboot as often as I should... it's a headless mac that I use a server in my house, so it can go more than a week or two without reboot. I know that I should reboot more, but I honestly haven't experienced performance issues doing that.
Just to confirm, though, I did just now reboot it and you are correct: I'm back to 147GB.
Also to confirm, I retried the scenario stated in the ticket above, and the issue is persisting even with the original working set size..
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 32764
llm_load_print_meta: n_embd = 8192
llm_load_print_meta: n_head = 64
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 140
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 28672
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 32764
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = Q8_0
llm_load_print_meta: model params = 120.32 B
llm_load_print_meta: model size = 119.06 GiB (8.50 BPW)
llm_load_print_meta: general.name = D:\text-generation-webui-main\models
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: PAD token = 0 '<unk>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.96 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 110508.00 MiB, (110508.38 / 147456.00)
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 476.00 MiB, (110984.38 / 147456.00)
/bin/sh: line 1: 822 Segmentation fault: 11
Note: This error does not occur in Koboldcpp version 1-58, which was showing being 36 commits behind llama.cpp last I looked. I am able to load the full layers up to 155b (when using the command to increase vram to 170GB) without issue.
This issue was closed because it has been inactive for 14 days since being marked as stale.
My Issue:
Up until today, I was using a version of Oobabooga from mid-December, and was successfully running 120b models with full metal offload. Today I updated Oobabooga to the latest version, and with it came a newer version of Llama.cpp.
Up until now, Llama.cpp on the Mac used either 0 or 1 for ngl; 0 off, 1 on. This version now respects the ngl flag completely, and a 120b model now can manually offload 141 layers on the Mac.
On the previous version of Llama.cpp, and all versions up until now, I've been able to load a 120b completely into the metal working space without issue. As of some recent version released since mid-December, I am now unable to increase the Metal memory usage past ~110GB.
Something odd happens when I attempt to offload more layers after hitting 110GB of usage; the console output looks like it splits out a new metal buffer line and then crashes. On a 120b model, this means the cut-off is 127 layers. If I go past that to 128, it crashes. I have attempted going up to 141 and even 256 layers; same result.
Example of Successful Load on older Llama.cpp from mid-December, with ngl set to 1
Example of Failed Load on new Llama.cpp, offloading 128 out of 141 layers.
Example of Successful Load on new Llama.cpp, offloading 127 out of 141 layers.