ggerganov / llama.cpp

LLM inference in C/C++
MIT License
67.37k stars 9.67k forks source link

Bug: Out of Memory Error when using Qwen2-72B-Instruct-Q4_K_M.gguf model with ROCm with num_ctx=32000 #8766

Closed renbuarl closed 1 month ago

renbuarl commented 3 months ago

What happened?

When using multiple AMD Radeon RX 7900 XTX (ROCM) graphics cards for different models, an Out of Memory Error occurs when the context size is significantly less than the maximum. There is enough memory, but this error can also occur for the llama3.1-8b model.

In the example for the Qwen2-72B-Instruct-Q4_K_M.gguf model, the error occurs when the real context is larger than 10k.

The (old) Qwen2-72B-Instruct-Q4_K_M.gguf model was taken to exclude errors that may occur with new llama3.1 models, which are the same.

AMD Radeon RX 7900 XTX (24 GiB VRAM) AMD Radeon RX 7900 XTX (24 GiB VRAM) AMD Radeon RX 7900 XTX (24 GiB VRAM) ROCm module version: 6.7.0 amdgpu-install_6.1.60103-1_all.deb

Model: Qwen2-72B-Instruct-Q4_K_M.gguf

I built the latest release of llama.cpp #b3488 following the methodology described in https://github.com/eliranwong/MultiAMDGPU_AIDev_Ubuntu (Thanks to the author!)

git clone https://github.com/ggerganov/llama.cpp cd llama.cpp make clean && make -j4 GGML_HIPBLAS=1 AMDGPU_TARGETS=gfx1100

~/llama.cpp/llama-server -v -m /home/user/backups/models/70/Qwen2-72B-Instruct-Q4_K_M.gguf -c 32000 --host '192.168.0.5' --port 8081 -ngl 99

When the real context is more than 10k,

llama_new_context_with_model: n_ctx = 32000 llama_new_context_with_model: n_batch = 2048 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: ROCm0 KV buffer size = 3375.00 MiB llama_kv_cache_init: ROCm1 KV buffer size = 3375.00 MiB llama_kv_cache_init: ROCm2 KV buffer size = 3250.00 MiB llama_new_context_with_model: KV self size = 10000.00 MiB, K (f16): 5000.00 MiB, V (f16): 5000.00 MiB llama_new_context_with_model: ROCm_Host output buffer size = 0.98 MiB llama_new_context_with_model: pipeline parallelism enabled (n_copies=4) llama_new_context_with_model: ROCm0 compute buffer size = 4378.01 MiB llama_new_context_with_model: ROCm1 compute buffer size = 4378.01 MiB llama_new_context_with_model: ROCm2 compute buffer size = 4378.02 MiB llama_new_context_with_model: ROCm_Host compute buffer size = 266.02 MiB llama_new_context_with_model: graph nodes = 2566 llama_new_context_with_model: graph splits = 4 CUDA error: out of memory current device: 2, in function alloc at ggml/src/ggml-cuda.cu:291 ggml_cuda_device_malloc(&ptr, look_ahead_size, device) ggml/src/ggml-cuda.cu:101: CUDA error 30 ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory. 38 ./posix/waitpid.c: No such file or directory. Aborted (core dumped)

Name and Version

llama.cpp$ ~/llama.cpp/llama-server --version version: 3489 (c887d8b0) built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

What operating system are you seeing the problem on?

Linux

Relevant log output

2731-0574. S2CID 259713140. Retrieved 2 July 2023.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
INFO [            update_slots] kv cache rm [p0, end) | tid="133749754233792" timestamp=1722330273 id_slot=0 id_task=0 p0=0
VERB [            update_slots] prompt processing progress | tid="133749754233792" timestamp=1722330273 id_slot=0 n_past=2048 n_ctx=32000 n_tokens=2048 progress=0.08170104026794434
VERB [            update_slots] decoding batch | tid="133749754233792" timestamp=1722330273 n_tokens=2048
VERB [            update_slots] run slots completed | tid="133749754233792" timestamp=1722330278
VERB [              start_loop] wait for new task | tid="133749754233792" timestamp=1722330278
VERB [              start_loop] new task may arrive | tid="133749754233792" timestamp=1722330278
VERB [              start_loop] callback_new_task | tid="133749754233792" timestamp=1722330278 id_task=1
VERB [              start_loop] update_multitasks | tid="133749754233792" timestamp=1722330278
VERB [              start_loop] callback_update_slots | tid="133749754233792" timestamp=1722330278
VERB [            update_slots] posting NEXT_RESPONSE | tid="133749754233792" timestamp=1722330278
VERB [                    post] new task id | tid="133749754233792" timestamp=1722330278 new_id=2
INFO [            update_slots] kv cache rm [p0, end) | tid="133749754233792" timestamp=1722330278 id_slot=0 id_task=0 p0=2048
VERB [            update_slots] prompt processing progress | tid="133749754233792" timestamp=1722330278 id_slot=0 n_past=4096 n_ctx=32000 n_tokens=2048 progress=0.16340208053588867
VERB [            update_slots] decoding batch | tid="133749754233792" timestamp=1722330278 n_tokens=2048
VERB [            update_slots] run slots completed | tid="133749754233792" timestamp=1722330283
VERB [              start_loop] wait for new task | tid="133749754233792" timestamp=1722330283
VERB [              start_loop] new task may arrive | tid="133749754233792" timestamp=1722330283
VERB [              start_loop] callback_new_task | tid="133749754233792" timestamp=1722330283 id_task=2
VERB [              start_loop] update_multitasks | tid="133749754233792" timestamp=1722330283
VERB [              start_loop] callback_update_slots | tid="133749754233792" timestamp=1722330283
VERB [            update_slots] posting NEXT_RESPONSE | tid="133749754233792" timestamp=1722330283
VERB [                    post] new task id | tid="133749754233792" timestamp=1722330283 new_id=3
INFO [            update_slots] kv cache rm [p0, end) | tid="133749754233792" timestamp=1722330283 id_slot=0 id_task=0 p0=4096
VERB [            update_slots] prompt processing progress | tid="133749754233792" timestamp=1722330283 id_slot=0 n_past=6144 n_ctx=32000 n_tokens=2048 progress=0.245103120803833
VERB [            update_slots] decoding batch | tid="133749754233792" timestamp=1722330283 n_tokens=2048
VERB [            update_slots] run slots completed | tid="133749754233792" timestamp=1722330289
VERB [              start_loop] wait for new task | tid="133749754233792" timestamp=1722330289
VERB [              start_loop] new task may arrive | tid="133749754233792" timestamp=1722330289
VERB [              start_loop] callback_new_task | tid="133749754233792" timestamp=1722330289 id_task=3
VERB [              start_loop] update_multitasks | tid="133749754233792" timestamp=1722330289
VERB [              start_loop] callback_update_slots | tid="133749754233792" timestamp=1722330289
VERB [            update_slots] posting NEXT_RESPONSE | tid="133749754233792" timestamp=1722330289
VERB [                    post] new task id | tid="133749754233792" timestamp=1722330289 new_id=4
INFO [            update_slots] kv cache rm [p0, end) | tid="133749754233792" timestamp=1722330289 id_slot=0 id_task=0 p0=6144
VERB [            update_slots] prompt processing progress | tid="133749754233792" timestamp=1722330289 id_slot=0 n_past=8192 n_ctx=32000 n_tokens=2048 progress=0.32680416107177734
VERB [            update_slots] decoding batch | tid="133749754233792" timestamp=1722330289 n_tokens=2048
VERB [            update_slots] run slots completed | tid="133749754233792" timestamp=1722330296
VERB [              start_loop] wait for new task | tid="133749754233792" timestamp=1722330296
VERB [              start_loop] new task may arrive | tid="133749754233792" timestamp=1722330296
VERB [              start_loop] callback_new_task | tid="133749754233792" timestamp=1722330296 id_task=4
VERB [              start_loop] update_multitasks | tid="133749754233792" timestamp=1722330296
VERB [              start_loop] callback_update_slots | tid="133749754233792" timestamp=1722330296
VERB [            update_slots] posting NEXT_RESPONSE | tid="133749754233792" timestamp=1722330296
VERB [                    post] new task id | tid="133749754233792" timestamp=1722330296 new_id=5
INFO [            update_slots] kv cache rm [p0, end) | tid="133749754233792" timestamp=1722330296 id_slot=0 id_task=0 p0=8192
VERB [            update_slots] prompt processing progress | tid="133749754233792" timestamp=1722330296 id_slot=0 n_past=10240 n_ctx=32000 n_tokens=2048 progress=0.4085052013397217
VERB [            update_slots] decoding batch | tid="133749754233792" timestamp=1722330296 n_tokens=2048
[New LWP 4093]
[New LWP 4194]
[New LWP 4195]
[New LWP 4196]
[New LWP 4197]
[New LWP 4198]
[New LWP 4199]
[New LWP 4200]
[New LWP 4201]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x000079a4d5eea42f in __GI___wait4 (pid=4212, stat_loc=stat_loc@entry=0x7ffc150bb744, options=options@entry=0, usage=usage@entry=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
#0  0x000079a4d5eea42f in __GI___wait4 (pid=4212, stat_loc=stat_loc@entry=0x7ffc150bb744, options=options@entry=0, usage=usage@entry=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30  in ../sysdeps/unix/sysv/linux/wait4.c
#1  0x000079a4d5eea3ab in __GI___waitpid (pid=<optimized out>, stat_loc=stat_loc@entry=0x7ffc150bb744, options=options@entry=0) at ./posix/waitpid.c:38
#2  0x000063a052aad41a in ggml_print_backtrace () at ggml/src/ggml.c:179
179         waitpid(pid, &wstatus, 0);
#3  ggml_abort (file=0x63a052cedc25 "ggml/src/ggml-cuda.cu", line=101, fmt=0x63a052cedc3b "CUDA error") at ggml/src/ggml.c:206
206     ggml_print_backtrace();
#4  0x000063a05292d302 in ggml_cuda_error (stmt=0x63a052cede5a "ggml_cuda_device_malloc(&ptr, look_ahead_size, device)", func=0x63a052cee58b "alloc", file=0x63a052cedc25 "ggml/src/ggml-cuda.cu", line=line@entry=291, msg=0x79a50ad9c9d2 "out of memory") at ggml/src/ggml-cuda.cu:101
101     GGML_ABORT("CUDA error");
#5  0x000063a0529414eb in ggml_cuda_pool_leg::alloc (this=0x63a09ae7a9b0, size=<optimized out>, actual_size=0x7ffc150bba08) at ggml/src/ggml-cuda.cu:291
291         CUDA_CHECK(ggml_cuda_device_malloc(&ptr, look_ahead_size, device));
#6  0x000063a05293fe02 in ggml_cuda_pool_alloc<__half>::alloc (this=0x7ffc150bb9f8, size=301989888) at ggml/src/ggml-cuda/common.cuh:534
534         ptr = (T *) pool->alloc(size * sizeof(T), &this->actual_size);
#7  ggml_cuda_mul_mat_batched_cublas (ctx=..., src0=0x63a057b4cd60, src1=0x63a057b4cbf0, dst=0x63a057b4ced0) at ggml/src/ggml-cuda.cu:1790
1790            dst_t = (char *) dst_f16.alloc(ne_dst);
#8  0x000063a0529306c1 in ggml_cuda_compute_forward (ctx=..., dst=0x63a057b4ced0) at ggml/src/ggml-cuda.cu:2241
2241                    ggml_cuda_mul_mat(ctx, dst->src[0], dst->src[1], dst);
#9  ggml_backend_cuda_graph_compute (backend=<optimized out>, cgraph=0x63a07998ae88) at ggml/src/ggml-cuda.cu:2602
2602                    bool ok = ggml_cuda_compute_forward(*cuda_ctx, node);
#10 0x000063a052af6585 in ggml_backend_sched_compute_splits (sched=0x63a079ce0200) at ggml/src/ggml-backend.c:1790
1790                enum ggml_status ec = ggml_backend_graph_compute_async(split_backend, &split->graph);
#11 ggml_backend_sched_graph_compute_async (sched=0x63a079ce0200, graph=<optimized out>) at ggml/src/ggml-backend.c:1977
1977        return ggml_backend_sched_compute_splits(sched);
#12 0x000063a052b4a7d0 in llama_graph_compute (n_threads=4, gf=0x63a057a6f4c0, lctx=...) at src/llama.cpp:14421
14421       ggml_backend_sched_graph_compute_async(lctx.sched, gf);
#13 llama_decode_internal (batch_all=..., batch_all=..., lctx=...) at src/llama.cpp:14634
14634           llama_graph_compute(lctx, gf, n_threads);
#14 llama_decode (ctx=0x63a0799b94c0, batch=...) at src/llama.cpp:18338
18338       const int ret = llama_decode_internal(*ctx, batch);
#15 0x000063a052cd93a0 in server_context::update_slots (this=0x7ffc150bd8e0) at examples/server/server.cpp:2339
2339                const int ret = llama_decode(ctx, batch_view);
#16 0x000063a052cc8b12 in std::function<void ()>::operator()() const (this=0x7ffc150be468) at /usr/include/c++/11/bits/std_function.h:590
590     return _M_invoker(_M_functor, std::forward<_ArgTypes>(__args)...);
#17 server_queue::start_loop (this=this@entry=0x7ffc150be380) at examples/server/server.cpp:500
500             callback_update_slots();
#18 0x000063a05291a961 in main (argc=<optimized out>, argv=<optimized out>) at examples/server/server.cpp:3440
3440        ctx_server.queue_tasks.start_loop();
[Inferior 1 (process 4076) detached]
ngxson commented 3 months ago

Sorry I didn't understand the issue correctly. Since you're having 3 GPUs and each one have 24GB, so in total there should be 24*3=72GB of total memory to work with. Probably there's a problem with HIPBLAS.

dspasyuk commented 3 months ago

@renbuarl have you tried using --flash-attn option?

renbuarl commented 3 months ago

@renbuarl have you tried using --flash-attn option?

Thank you! Great advice to use the '--flash-attn' option.

~/llama.cpp/llama-server -v -m /home/user/backups/models/70/Qwen2-72B-Instruct-Q4_K_M.gguf -c 65536 --host '192.168.0.5' --port 8081 -ngl 99 --flash-attn

Maximum vram consumption is 68.88 GB with a real context of 32k, and there is no 'CUDA error: out of memory'.

renbuarl commented 3 months ago

'CUDA error: out of memory'.

What do we have?

When launching without the --flash-attn option for llama-server

~/llama.cpp/llama-server -v -m /home/user/backups/models/70/Qwen2-72B-Instruct-Q4_K_M.gguf -c 32768 --host '192.168.0.5' --port 8081 -ngl 99 The average VRAM consumption is 68.40 GB but we crash with 'CUDA error: out of memory' with relatively small actual context.

When launching with the --flash-attn option for llama-server it works perfectly ~/llama.cpp/llama-server -v -m /home/user/backups/models/70/Qwen2-72B-Instruct-Q4_K_M.gguf -c 32768 --host '192.168.0.5' --port 8081 -ngl 99 --flash-attn The average VRAM consumption is 58.56 GB.

ngxson commented 3 months ago

@renbuarl Forgot to mention, have you tried changing batch size and ubatch size?

The average VRAM consumption is 68.40 GB

According to my calculation above, then you're having 72GB in total, so it's quite reasonable that it crashes at 68.40GB being filled (due to overhead)

renbuarl commented 3 months ago

@renbuarl Forgot to mention, have you tried changing batch size and ubatch size?

The average VRAM consumption is 68.40 GB

According to my calculation above, then you're having 72GB in total, so it's quite reasonable that it crashes at 68.40GB being filled (due to overhead)

No, I haven't tried that yet. Could you please provide an example?

ngxson commented 3 months ago

For example: -b 1024 -ub 64

Default value:

    int32_t n_batch               =  2048; // logical batch size for prompt processing (must be >=32 to use BLAS)
    int32_t n_ubatch              =   512; // physical batch size for prompt processing (must be >=32 to use BLAS)

The memory usage when running LLM (in general, not just llama.cpp), contains weight of model + KV cache + overhead for graph computation (depends on batch size), so that's why 68/72GB may not be enough (since you're not counting overhead).

renbuarl commented 3 months ago

For example: -b 1024 -ub 64

@ngxson thanks, that helped too!

Here are the measurements for different options:

~/llama.cpp/llama-server -v -m /home/user/backups/models/70/Qwen2-72B-Instruct-Q4_K_M.gguf -c 32768 --host '192.168.0.5' --port 8081 -ngl 99 vram 68.40 Gb

~/llama.cpp/llama-server -v -m /home/user/backups/models/70/Qwen2-72B-Instruct-Q4_K_M.gguf -c 32768 --host '192.168.0.5' --port 8081 -ngl 99 --flash-attn vram 58.56 Gb

~/llama.cpp/llama-server -v -m /home/user/backups/models/70/Qwen2-72B-Instruct-Q4_K_M.gguf -c 32768 --host '192.168.0.5' --port 8081 -ngl 99 -b 1024 -ub 64 vram 59.04 Gb

~/llama.cpp/llama-server -v -m /home/user/backups/models/70/Qwen2-72B-Instruct-Q4_K_M.gguf -c 32768 --host '192.168.0.5' --port 8081 -ngl 99 -b 1024 -ub 64 --flash-attn vram 57.12 Gb

The best option is with the options -b 1024 -ub 64 --flash-attn.

However, it seemed to me that with -b 1024 -ub 64 it works slower.

github-actions[bot] commented 1 month ago

This issue was closed because it has been inactive for 14 days since being marked as stale.