jphme commented 1 year ago

I own a Macbook Pro M2 with 32GB memory and try to do inference with a 33B model. Without Metal (or -ngl 1 flag) this works fine and 13B models also work fine both with or without METAL. There is sufficient free memory available.

Inference always fails with the error: ggml_metal_add_buffer: buffer 'data' size 18300780544 is larger than buffer maximum of 17179869184

Prerequisites

Please answer the following questions for yourself before submitting an issue.

[X] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
[X] I carefully followed the README.md.
[X] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
[X] I reviewed the Discussions, and have a new bug or useful enhancement to share. -> same problem as https://github.com/abetlen/llama-cpp-python/discussions/361 , but no Issue here as of yet?

Expected Behavior

I own a Mac Pro M2 with 32GB memory and try to do inference with a 33B model. Without Metal this works fine and 13B models also work fine with or without METAL. There is sufficient free memory available.

Current Behavior

> llama.cpp git:(master) ./main -m ~/dev2/text-generation-webui/models/guanaco-33B.ggmlv3.q4_0.bin -p "Building a website can be done in 10 simple steps:" -n 512 -ngl 1

main: build = 661 (fa84c4b)
main: seed  = 1686556467
llama.cpp: loading model from /Users/jp/dev2/models/guanaco-33B.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 6656
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 52
llama_model_load_internal: n_layer    = 60
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 17920
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 30B
llama_model_load_internal: ggml ctx size =    0,13 MB
llama_model_load_internal: mem required  = 19756,66 MB (+ 3124,00 MB per state)
.
llama_init_from_file: kv self size  =  780,00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Users/jp/dev2/llama.cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add                            0x139809af0
ggml_metal_init: loaded kernel_mul                            0x13980a210
ggml_metal_init: loaded kernel_mul_row                        0x138f05fa0
ggml_metal_init: loaded kernel_scale                          0x138f06430
ggml_metal_init: loaded kernel_silu                           0x13980a610
ggml_metal_init: loaded kernel_relu                           0x13980ac50
ggml_metal_init: loaded kernel_gelu                           0x138f06830
ggml_metal_init: loaded kernel_soft_max                       0x126204210
ggml_metal_init: loaded kernel_diag_mask_inf                  0x126204c70
ggml_metal_init: loaded kernel_get_rows_f16                   0x138f07030
ggml_metal_init: loaded kernel_get_rows_q4_0                  0x138f07830
ggml_metal_init: loaded kernel_get_rows_q4_1                  0x13980b4b0
ggml_metal_init: loaded kernel_get_rows_q2_k                  0x13980bcb0
ggml_metal_init: loaded kernel_get_rows_q4_k                  0x138f07ed0
ggml_metal_init: loaded kernel_get_rows_q6_k                  0x138f08690
ggml_metal_init: loaded kernel_rms_norm                       0x138f08dd0
ggml_metal_init: loaded kernel_mul_mat_f16_f32                0x138f09870
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32               0x13980c750
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32               0x13980d050
ggml_metal_init: loaded kernel_mul_mat_q2_k_f32               0x138f09fd0
ggml_metal_init: loaded kernel_mul_mat_q4_k_f32               0x138f0a8d0
ggml_metal_init: loaded kernel_mul_mat_q6_k_f32               0x13980da40
ggml_metal_init: loaded kernel_rope                           0x126205440
ggml_metal_init: loaded kernel_cpy_f32_f16                    0x13980e6b0
ggml_metal_init: loaded kernel_cpy_f32_f32                    0x13980f130
ggml_metal_add_buffer: buffer 'data' size 18300780544 is larger than buffer maximum of 17179869184
llama_init_from_file: failed to add buffer
llama_init_from_gpt_params: error: failed to load model '/Users/jp/dev2/models/guanaco-33B.ggmlv3.q4_0.bin'
main: error: unable to load model

Is this known/expected and are there any workarounds? The mentioned "buffer maximum" of 17179869184 stay the same regardless of how much memory is free.

ymcui commented 1 year ago

You might be interested to check this PR. TL;DR: It's something related to system limits on maximum buffer size. Currently no workarounds, but they are actively looking into this case.

TheBloke commented 1 year ago

FYI I have the same problem on Intel macOS with AMD 6900XT GPU.

Except in my case, it happens on all models: even q4_0 7B can't be loaded when -ngl 1 is added

tomj@Eddie ~/src/llama.cpp (master●)$ time ./main -ngl 1  -t 12 -m ~/src/huggingface/Wizard-Vicuna-7B-Uncensored-GGML/Wizard-Vicuna-7B-Uncensored.ggmlv3.q4_0.bin --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: write a story about llamas ### Response:"
main: build = 661 (fa84c4b)
main: seed  = 1686559968
llama.cpp: loading model from /Users/tomj/src/huggingface/Wizard-Vicuna-7B-Uncensored-GGML/Wizard-Vicuna-7B-Uncensored.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: mem required  = 5407.71 MB (+ 1026.00 MB per state)
.
llama_init_from_file: kv self size  = 1024.00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Users/tomj/src/llama.cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add                         0x7f7c97805500
ggml_metal_init: loaded kernel_mul                         0x7f7c97806430
ggml_metal_init: loaded kernel_mul_row                     0x7f7c978071e0
ggml_metal_init: loaded kernel_scale                       0x7f7c97807f90
ggml_metal_init: loaded kernel_silu                        0x7f7c96f07ce0
ggml_metal_init: loaded kernel_relu                        0x7f7c96f08a10
ggml_metal_init: loaded kernel_gelu                        0x7f7c96f097c0
ggml_metal_init: loaded kernel_soft_max                    0x7f7c97808d40
ggml_metal_init: loaded kernel_diag_mask_inf               0x7f7c97809af0
ggml_metal_init: loaded kernel_get_rows_f16                0x7f7c9780a8a0
ggml_metal_init: loaded kernel_get_rows_q4_0               0x7f7c9780b650
ggml_metal_init: loaded kernel_get_rows_q4_1               0x7f7c9780c570
ggml_metal_init: loaded kernel_get_rows_q2_k               0x7f7c77806080
ggml_metal_init: loaded kernel_get_rows_q4_k               0x7f7c77806e10
ggml_metal_init: loaded kernel_get_rows_q6_k               0x7f7c77807d40
ggml_metal_init: loaded kernel_rms_norm                    0x7f7c77808af0
ggml_metal_init: loaded kernel_mul_mat_f16_f32             0x7f7c778098a0
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32            0x7f7c7780a650
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32            0x7f7c7780b400
ggml_metal_init: loaded kernel_mul_mat_q2_k_f32            0x7f7c7780c440
ggml_metal_init: loaded kernel_mul_mat_q4_k_f32            0x7f7c7780d1f0
ggml_metal_init: loaded kernel_mul_mat_q6_k_f32            0x7f7c7780dfa0
ggml_metal_init: loaded kernel_rope                        0x7f7c7780ed50
ggml_metal_init: loaded kernel_cpy_f32_f16                 0x7f7c7780fc80
ggml_metal_init: loaded kernel_cpy_f32_f32                 0x7f7c77810a30
ggml_metal_add_buffer: buffer 'data' size 3791728640 is larger than buffer maximum of 3758096384
llama_init_from_file: failed to add buffer
llama_init_from_gpt_params: error: failed to load model '/Users/tomj/src/huggingface/Wizard-Vicuna-7B-Uncensored-GGML/Wizard-Vicuna-7B-Uncensored.ggmlv3.q4_0.bin'
main: error: unable to load model
./main -ngl 1 -t 12 -m  --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1  0.58s user 0.66s system 93% cpu 1.338 total

My system:

jphme commented 1 year ago

You might be interested to check this PR. TL;DR: It's something related to system limits on maximum buffer size. Currently no workarounds, but they are actively looking into this case.

Ah thanks, I didn't find this (and the discussion is apparently not about the merged PR but a future/necessary PR) but apparently the problem is that there is some limit at 1/2 of total memory which would explain everything up to 13B working and 33B (with 20GB/32GB) not: https://github.com/ggerganov/llama.cpp/pull/1696#issuecomment-1579049142 .

Happy to test solutions if someone is working on this.

@TheBloke Then your error is probably unrelated, the 7B should fit into GPU memory, but I don't know anything at all about non-apple GPUs and metal. Thanks for your great work btw :).

ikawrakow commented 1 year ago

Does #1817 fix your issue?

jphme commented 1 year ago

Does #1817 fix your issue?

no, stays the same.

ymcui commented 1 year ago

Does #1817 fix your issue?

@ikawrakow Not working in my case. Still shows similar error message.

ggml_metal_add_buffer: buffer 'data' size 18435457024 is larger than buffer maximum of 17179869184
llama_init_from_file: failed to add buffer
llama_init_from_gpt_params: error: failed to load model 'zh-alpaca-models/33B/ggml-model-q4_0.bin'
main: error: unable to load model

ymcui commented 1 year ago

Update: Q2_K seems to be the only one that works with -ngl 1 for 33B models. Tested under M1 Max @ 32GB RAM for Chinese Alpaca 33B models.

Model	Size	`-t 8`	`-t 8 -ngl 1`
q4_0	18.4 GB	170ms/tok	failed
q3_K_M	15.7 GB	216ms/tok	not implemented
q2_K	13.7 GB	178ms/tok	128ms/tok

Note: Reported numbers are based on eval_time.

ikawrakow commented 1 year ago

It looks like Apple has decided to limit the maximum length of a buffer to some fraction of the available memory. In my case, on a 64 GB M2 Max laptop, the maximum buffer length is reported as 36 GiB. The 17179869184 bytes given in your error message as maximum are exactly 16 GiB, so I guess you are running on a system with 32 GB RAM?

ymcui commented 1 year ago

It looks like Apple has decided to limit the maximum length of a buffer to some fraction of the available memory. In my case, on a 64 GB M2 Max laptop, the maximum buffer length is reported as 36 GiB. The 17179869184 bytes given in your error message as maximum are exactly 16 GiB, so I guess you are running on a system with 32 GB RAM?

Yes. M1 Max with 32GB RAM. Does that mean there is no way to load 33B q4_0 model with 32GB RAM in any way?

ikawrakow commented 1 year ago

Looking at the code, it seems the model is being passed to Metal as a single, no-copy buffer. My guess is that the change required to split the model into 2 or more buffers to circumvent the maxBufferLength restriction is significant, but perhaps @ggerganov can chime in.

I just tried the 33B model on my laptop and it works fine:

./bin/main -m q4k_30B.bin -p "I believe the meaning of life is" -c 2048 -n 512 --ignore-eos -n 256 -s 1234 -t 8 -ngl 1
main: build = 677 (a6812a1)
main: seed  = 1234
llama.cpp: loading model from q4k_30B.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 6656
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 52
llama_model_load_internal: n_layer    = 60
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 15 (mostly Q4_K - Medium)
llama_model_load_internal: n_ff       = 17920
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 30B
llama_model_load_internal: ggml ctx size =    0.13 MB
llama_model_load_internal: mem required  = 21015.59 MB (+ 3124.00 MB per state)
.
llama_init_from_file: kv self size  = 3120.00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
XXXXXXXXXXXXXXXXXXXX Device max buffer length is 36 GiB
ggml_metal_init: loading '/Users/iwan/other/llama.cpp/build/bin/ggml-metal.metal'
ggml_metal_init: loaded kernel_add                            0x124b05e20
ggml_metal_init: loaded kernel_mul                            0x124b06560
ggml_metal_init: loaded kernel_mul_row                        0x124b06b10
ggml_metal_init: loaded kernel_scale                          0x124b06fb0
ggml_metal_init: loaded kernel_silu                           0x124b07450
ggml_metal_init: loaded kernel_relu                           0x124b078f0
ggml_metal_init: loaded kernel_gelu                           0x124b07d90
ggml_metal_init: loaded kernel_soft_max                       0x124b083c0
ggml_metal_init: loaded kernel_diag_mask_inf                  0x124b089a0
ggml_metal_init: loaded kernel_get_rows_f16                   0x124b08fa0
ggml_metal_init: loaded kernel_get_rows_q4_0                  0x124b095a0
ggml_metal_init: loaded kernel_get_rows_q4_1                  0x124b09d10
ggml_metal_init: loaded kernel_get_rows_q2_k                  0x124b0a310
ggml_metal_init: loaded kernel_get_rows_q3_k                  0x124b0a910
ggml_metal_init: loaded kernel_get_rows_q4_k                  0x124b0af10
ggml_metal_init: loaded kernel_get_rows_q5_k                  0x124b0b510
ggml_metal_init: loaded kernel_get_rows_q6_k                  0x124b0bb10
ggml_metal_init: loaded kernel_rms_norm                       0x124b0c140
ggml_metal_init: loaded kernel_mul_mat_f16_f32                0x124b0c920
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32               0x124b0d0f0
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32               0x124b0d750
ggml_metal_init: loaded kernel_mul_mat_q2_k_f32               0x124b0ddb0
ggml_metal_init: loaded kernel_mul_mat_q3_k_f32               0x124b0e430
ggml_metal_init: loaded kernel_mul_mat_q4_k_f32               0x124b0ec10
ggml_metal_init: loaded kernel_mul_mat_q5_k_f32               0x124b0f270
ggml_metal_init: loaded kernel_mul_mat_q6_k_f32               0x124b0f8d0
ggml_metal_init: loaded kernel_rope                           0x124b10140
ggml_metal_init: loaded kernel_cpy_f32_f16                    0x124b10b50
ggml_metal_init: loaded kernel_cpy_f32_f32                    0x124b11360
ggml_metal_add_buffer: allocated 'data            ' buffer, size = 18711.91 MB
ggml_metal_add_buffer: allocated 'eval            ' buffer, size =  1280.00 MB
ggml_metal_add_buffer: allocated 'kv              ' buffer, size =  3122.00 MB
ggml_metal_add_buffer: allocated 'scr0            ' buffer, size =   512.00 MB
ggml_metal_add_buffer: allocated 'scr1            ' buffer, size =   512.00 MB

system_info: n_threads = 8 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 2048, n_batch = 512, n_predict = 256, n_keep = 0

 I believe the meaning of life is to learn how to be happy. Happiness is a choice. It’s not something that you will fall into or that you get by accident. So, I choose to be happy!
“It took me four years to paint like Raphael, but a lifetime to paint like a child.” Pablo Picasso
I am so very excited about my newest book; it comes out August 2014! It is called “The Happy Book” and it’s all about how happiness is our birthright. As the title implies, there are no sad or scary stories in this book (which is unusual for me!)
This book begins with a little girl who is not happy because she doesn’t like rainbows, sunshine, butterflies… and other things that are supposed to make you happy. But then her grandmother tells her about the happiness seed that lives inside her heart—and the story blossoms into an inspiring and enlightening tale of how to “grow” your own happiness.
I had a wonderful time creating this book. I got so excited when one page was finished, I couldn’t wait to begin the next! It will be available online and at fine books
llama_print_timings:        load time =  1796.08 ms
llama_print_timings:      sample time =   169.78 ms /   256 runs   (    0.66 ms per token)
llama_print_timings: prompt eval time =  1274.83 ms /     8 tokens (  159.35 ms per token)
llama_print_timings:        eval time = 26968.65 ms /   255 runs   (  105.76 ms per token)
llama_print_timings:       total time = 28962.71 ms

TheBloke commented 1 year ago

@TheBloke Then your error is probably unrelated, the 7B should fit into GPU memory, but I don't know anything at all about non-apple GPUs and metal. Thanks for your great work btw :).

You're right, I should raise a separate issue!

ggerganov commented 1 year ago

The fix to ggml_metal_add_buffer: buffer 'data' size 18435457024 is larger than buffer maximum of 17179869184 is discussed here: https://github.com/ggerganov/llama.cpp/pull/1696#issuecomment-1585667275

I think the proposed solution there has to work, but not 100% sure. I'll push a fix soon - it's just low prio atm + want to see if someone would figure it out

ikawrakow commented 1 year ago

@ggerganov Isn't the splitting somewhat tricky? I mean, we cannot just randomly split the model because some tensors may end up split in the middle, which will lead to garbage results. But if we attempt to split at tensor boundaries, those may not be page aligned. But the buffer being given to the Metal framework must be page aligned. We cannot have overlapping buffers either, which would be needed to have a tensor completely within a buffer and the buffer be page aligned, unless the ggml_metal_get_buffer() and ggml_metal_add_buffer() functions are changed to deal with overlapping buffers.

I'm asking because I thought I could do this quickly, but it turns out to be trickier than it might seem from the discussion in #1696

jacobfriedman commented 1 year ago

@kiltyj any progress you've made- please upload/make a branch for us to continue work on

kiltyj commented 1 year ago

I just created #1825 to capture the code I've written so far. As I note there, buffer splitting seems to be working for smaller models, but there's still something I'm missing that is causing issues when I try to split up (e.g.) a 65B model.

I'll keep poking at this after hours as I can, but if anyone spots anything, let me know!

CyborgArmy83 commented 1 year ago

Forgive me for my novelty here. Is this the reason why my M2 Max 64GB system is not able to load larger than 30B models and use it with GPU acceleration?

ggerganov commented 1 year ago

@CyborgArmy83 - very likely

CyborgArmy83 commented 1 year ago

That's crazy! So the whole concept of using a large amount of unified memory for GPU/Metal is flawed? or is there something we can do? Maybe people should also send this to Apple to see if they can update something in the metal framework or am I missing a key piece of understanding here?

frankandrobot commented 1 year ago

Has this been fixed? Because I'm still seeing this.

llama-cpp-python         0.1.67

ggml_metal_add_buffer: buffer 'data' size 10678861824 is larger than buffer maximum of 8589934592
llama_init_from_file: failed to add buffer

when trying to load LLaMa-13B-GGML/llama-13b.ggmlv3.q6_K.bin

kiltyj commented 1 year ago

FYI, I've been hacking around with some ideas related to this issue in #2069. I don't think it's quite ready for merging / there's still a lot to figure out, but I'd be happy for more eyes/ideas.

ggerganov / llama.cpp

[METAL] GPU Inference fails due to buffer error (buffer "data" size is larger than buffer maximum) #1815

Prerequisites

Expected Behavior

Current Behavior