Segmentation fault during inference on AMD gfx900 with codebooga-34b-v0.1.Q5_K_M.gguf

jin-eld commented 7 months ago

Hi,

I compiled llama.cpp from git, todays master HEAD commit 8030da7afea2d89f997aeadbd14183d399a017b9 on Fedora Rawhide (ROCm 6.0.x) like this:

CC=/usr/bin/clang CXX=/usr/bin/clang++ cmake .. -DLLAMA_HIPBLAS=ON -DAMDGPU_TARGETS=gfx900 -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_FLAGS="--rocm-device-lib-path=/usr/lib/clang/17/amdgcn/bitcode"
make -j 16

Then I tried to run a prompt using the codebooga-34b-v0.1.Q5_K_M.gguf model which I got from here: https://huggingface.co/TheBloke/CodeBooga-34B-v0.1-GGUF

I kept the prompt simple and used the following command: ./main -t 10 -ngl 16 -m ~/models/codebooga-34b-v0.1.Q5_K_M.gguf --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "### Instruction: How do I get the length of a Vec in Rust?\n### Response:"

I have an AMD Instinct MI25 card with 16GB VRAM, according to nvtop with -ngl 16 about half of it is used 8.219Gi/15.984, so this does not seem to be an OOM issue.

The console output looks like this:

Log start                                                                       
main: build = 2408 (8030da7a)
main: built with clang version 18.1.0 (Fedora 18.1.0~rc4-2.fc41) for x86_64-redhat-linux-gnu
main: seed  = 1710292844
[New Thread 0x7fff074006c0 (LWP 11038)]
[New Thread 0x7ffe068006c0 (LWP 11039)]
[Thread 0x7ffe068006c0 (LWP 11039) exited]
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 ROCm devices:
  Device 0: AMD Radeon Instinct MI25, compute capability 9.0, VMM: no
llama_model_loader: loaded meta data with 21 key-value pairs and 435 tensors from /home/jin/Work/text-generation-webui/models/codebooga-34b-v0.1.Q5_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = oobabooga_codebooga-34b-v0.1
llama_model_loader: - kv   2:                       llama.context_length u32              = 16384
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 8192
llama_model_loader: - kv   4:                          llama.block_count u32              = 48
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 22016
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 64
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 17
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 2
llama_model_loader: - kv  20:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   97 tensors
llama_model_loader: - type q5_K:  289 tensors
llama_model_loader: - type q6_K:   49 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 16384
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 48
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 22016
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attm      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 16384
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 34B
llm_load_print_meta: model ftype      = Q5_K - Medium
llm_load_print_meta: model params     = 33.74 B
llm_load_print_meta: model size       = 22.20 GiB (5.65 BPW) 
llm_load_print_meta: general.name     = oobabooga_codebooga-34b-v0.1
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 2 '</s>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.33 MiB
llm_load_tensors: offloading 16 repeating layers to GPU
llm_load_tensors: offloaded 16/49 layers to GPU
llm_load_tensors:      ROCm0 buffer size =  7500.06 MiB
llm_load_tensors:        CPU buffer size = 22733.73 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      ROCm0 KV buffer size =   128.00 MiB
llama_kv_cache_init:  ROCm_Host KV buffer size =   256.00 MiB
llama_new_context_with_model: KV self size  =  384.00 MiB, K (f16):  192.00 MiB, V (f16):  192.00 MiB
llama_new_context_with_model:  ROCm_Host input buffer size   =    21.02 MiB
ggml_gallocr_reserve_n: reallocating ROCm0 buffer from size 0.00 MiB to 324.00 MiB
ggml_gallocr_reserve_n: reallocating ROCm_Host buffer from size 0.00 MiB to 336.00 MiB
llama_new_context_with_model:      ROCm0 compute buffer size =   324.00 MiB
llama_new_context_with_model:  ROCm_Host compute buffer size =   336.00 MiB
llama_new_context_with_model: graph splits (measure): 3

Shortly after I get a segfault, although sometimes it starts responding and crashes a few seconds into the response:

(gdb) bt
#0  amd::KernelParameters::set (this=0x1d9cb10, index=11, size=4, 
    value=0x100000020, svmBound=false)
    at /usr/src/debug/rocclr-6.0.2-1.fc41.x86_64/rocclr/platform/kernel.cpp:127
#1  0x00007fffb9822b7c in ihipLaunchKernel_validate (f=f@entry=0x3281e20, 
    globalWorkSizeX=globalWorkSizeX@entry=4096, 
    globalWorkSizeY=globalWorkSizeY@entry=1, 
    globalWorkSizeZ=globalWorkSizeZ@entry=1, blockDimX=blockDimX@entry=32, 
    blockDimY=blockDimY@entry=1, blockDimZ=1, sharedMemBytes=256, 
    kernelParams=0x7fffffff7430, extra=0x0, deviceId=0, params=0)
    at /usr/src/debug/rocclr-6.0.2-1.fc41.x86_64/hipamd/src/hip_module.cpp:301
#2  0x00007fffb98273fd in ihipModuleLaunchKernel (f=0x3281e20, 
    globalWorkSizeX=4096, globalWorkSizeY=1, globalWorkSizeZ=1, blockDimX=32, 
    blockDimY=1, blockDimZ=1, sharedMemBytes=256, hStream=0x195d320, 
    kernelParams=0x7fffffff7430, extra=0x0, startEvent=0x0, stopEvent=0x0, 
    flags=0, params=0, gridId=0, numGrids=0, prevGridSum=0, allGridSum=0, 
    firstDevice=0)
    at /usr/src/debug/rocclr-6.0.2-1.fc41.x86_64/hipamd/src/hip_module.cpp:371
#3  0x00007fffb98492a2 in ihipLaunchKernel (
    hostFunction=0x679308 <void soft_max_f32<true, 32, 32>(float const*, float const*, float const*, float*, int, int, float, float, float, float, unsigned int)--Type <RET> for more, q to quit, c to continue without paging--
>, gridDim=..., blockDim=..., args=0x7fffffff7430, sharedMemBytes=256, 
    stream=0x195d320, startEvent=0x0, stopEvent=0x0, flags=0)
    at /usr/src/debug/rocclr-6.0.2-1.fc41.x86_64/hipamd/src/hip_platform.cpp:584
#4  0x00007fffb9822519 in hipLaunchKernel_common (
    hostFunction=hostFunction@entry=0x679308 <void soft_max_f32<true, 32, 32>(float const*, float const*, float const*, float*, int, int, float, float, float, float, unsigned int)>, gridDim=..., blockDim=..., 
    args=args@entry=0x7fffffff7430, sharedMemBytes=256, stream=<optimized out>)
    at /usr/src/debug/rocclr-6.0.2-1.fc41.x86_64/hipamd/src/hip_module.cpp:662
#5  0x00007fffb9824b83 in hipLaunchKernel (hostFunction=<optimized out>, 
    gridDim=..., blockDim=..., args=0x7fffffff7430, 
    sharedMemBytes=<optimized out>, stream=<optimized out>)
    at /usr/src/debug/rocclr-6.0.2-1.fc41.x86_64/hipamd/src/hip_module.cpp:669
#6  0x000000000062ea50 in void __device_stub__soft_max_f32<true, 32, 32>(float const*, float const*, float const*, float*, int, int, float, float, float, float, unsigned int) ()
#7  0x000000000062e1f9 in soft_max_f32_cuda (x=0x7ff65e400800, 
    mask=0x7ff65c000800, pos=0x0, dst=0x7ff65e400800, ncols_x=32, nrows_x=128, 
    nrows_y=2, scale=0.0883883461, max_bias=0, stream=0x195d320)
    at /llama.cpp/ggml-cuda.cu:7505
#8  0x000000000062ded6 in ggml_cuda_op_soft_max (src0=0x7ff66eca9450, 
    src1=0x7ff66e80ee50, dst=0x7ff66eca95e0, src0_dd=0x7ff65e400800, 
    src1_dd=0x7ff65c000800, dst_dd=0x7ff65e400800, main_stream=0x195d320)
    at /llama.cpp/ggml-cuda.cu:9053
#9  0x00000000005f98f7 in ggml_cuda_op_flatten (src0=0x7ff66eca9450, 
    src1=0x7ff66e80ee50, dst=0x7ff66eca95e0, 
    op=0x62db50 <ggml_cuda_op_soft_max(ggml_tensor const*, ggml_tensor const*, ggml_tensor*, float const*, float const*, float*, ihipStream_t*)>)
    at /llama.cpp/ggml-cuda.cu:9145
#10 0x00000000005f856f in ggml_cuda_soft_max (src0=0x7ff66eca9450, 
    src1=0x7ff66e80ee50, dst=0x7ff66eca95e0)
    at /llama.cpp/ggml-cuda.cu:10393
#11 0x00000000005f5cb8 in ggml_cuda_compute_forward (params=0x7fffffff7b78, 
    tensor=0x7ff66eca95e0) at /llama.cpp/ggml-cuda.cu:10619
#12 0x0000000000635106 in ggml_backend_cuda_graph_compute (backend=0x19e1420, 
    cgraph=0x7ff66e8002d8) at /llama.cpp/ggml-cuda.cu:11310
#13 0x00000000005c1d42 in ggml_backend_graph_compute (backend=0x19e1420, 
    cgraph=0x7ff66e8002d8) at /llama.cpp/ggml-backend.c:270
#14 0x00000000005c55c3 in ggml_backend_sched_compute_splits (
--Type <RET> for more, q to quit, c to continue without paging--
    sched=0x7ff66e800010) at /llama.cpp/ggml-backend.c:1474
#15 0x00000000005c5237 in ggml_backend_sched_graph_compute (
    sched=0x7ff66e800010, graph=0x7ff66ec00030)
    at /llama.cpp/ggml-backend.c:1597
#16 0x00000000004f85e9 in llama_graph_compute (lctx=..., gf=0x7ff66ec00030, 
    n_threads=10) at /llama.cpp/llama.cpp:8733
#17 0x00000000004b7926 in llama_decode_internal (lctx=..., batch=...)
    at /llama.cpp/llama.cpp:8887
#18 0x00000000004b6fc3 in llama_decode (ctx=0x19f7b60, batch=...)
    at /llama.cpp/llama.cpp:13837
#19 0x0000000000452e95 in llama_init_from_gpt_params (params=...)
    at /llama.cpp/common/common.cpp:1380
#20 0x000000000042c0a5 in main (argc=18, argv=0x7fffffffdac8)
    at /llama.cpp/examples/main/main.cpp:199

I saw some issues about partial offloading and also tried a smaller model which should completely fit on my GPU, but the segfault was still there, the smaller model is this one:

llm_load_print_meta: model type       = 13B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 13.02 B
llm_load_print_meta: model size       = 12.88 GiB (8.50 BPW) 
llm_load_print_meta: general.name     = newhope.ggmlv3.q8_0.bin
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.28 MiB
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors:      ROCm0 buffer size = 13023.85 MiB
llm_load_tensors:        CPU buffer size =   166.02 MiB

Crashed as well with a very similar backtrace.

Since this is nicely reproducible, I can provide more more info or add some debug logs as needed, please let me know what you need.

8XXD8 commented 7 months ago

Rocm 6 doesn't support gfx900, the latest that still works is 5.7.3.

jin-eld commented 7 months ago

Rocm 6 doesn't support gfx900, the latest that still works is 5.7.3.

Yes and no - gfx900 will not get any further updates and fixes or features, that being said - support for it was not removed and it still works with ROCm 6.0.x. StableDiffusion, Kohya_ss, audiocraft, PyTorch - still work as expected. I could run the ROCm test suite tonight to recheck, but the applications I have been using so far work fine.

8XXD8 commented 7 months ago

I tried too, anything compiled for 5.7 works with Rocm 6, but when compiled with 6 it breaks. Pytorch nightly doesn't work, since it is compiled for Rocm 6. Some parts of Rocm 6 are not built for gfx900, like RocSolver, I manually added it to the Cmakelists.txt and it builds fine, but building the entire Rocm suite from source is a gigantic hassle.

jin-eld commented 7 months ago

but building the entire Rocm suite from source is a gigantic hassle.

absolutely true, been there :) However, Fedora 40 will ship ROCm 6.0 by default, I am already testing it on Rawhide and it works fine (some environment variable quirks are required).

PyTorch nightly for ROCm 6 works for me on Rawhide as well. I can check tonight if the rocsolver rpm includes gfx900, but afair the Fedora packages build and support all available GPU architectures.

We will see that new ROCm libraries will not support gfx900 at all, like hipBLASLt, but at least from what I can see, stuff that worked for the gfx900 in 5.7 is still there in 6.0. When switching from 5.7 to 6.0 you need to disable SDMA on gfx900 export HSA_ENABLE_SDMA=0, see https://github.com/ROCm/ROCm/issues/2781

8XXD8 commented 7 months ago

Thanks for the tip, export HSA_ENABLE_SDMA=0 did it for me, now Rocm 6 works with my MI25. I compiled llama.cpp with make clean && CC=/home/user/llvm/bin/clang CXX=/home/user/llvm/bin/clang++ make main LLAMA_HIPBLAS=1 AMDGPU_TARGETS="gfx900;gfx906" -j64 and can confirm that codebooga-34b-v0.1.Q5_K_M.gguf works with both multi gpu, and partial offload too. I think your problem might be the clang version, Rocm 6 comes with 17, and my version is built from the latest https://github.com/ROCm/llvm-project.

jin-eld commented 7 months ago

now Rocm 6 works with my MI25

ah, another poor soul on MI25, how did you manage to cool these damn things, I am still struggling with that :)

I think your problem might be the clang version, Rocm 6 comes with 17, and my version is built from the latest https://github.com/ROCm/llvm-project.

Do I understand correctly - you compiled HEAD of https://github.com/ROCm/llvm-project and used it to build llama.cpp or did you rebuild the whole ROCm suite with it?

By the way I ran rvs today (ROCm validation suite) and it finished without any errors, so ROCm-wise everything should work...

8XXD8 commented 7 months ago

how did you manage to cool these damn things

I use a Delta bfb1012hh blower

Do I understand correctly - you compiled HEAD of https://github.com/ROCm/llvm-project and used it to build llama.cpp

Yes, I compiled llama.cpp with it, but also works with the clang 17 that comes with Rocm 6

jin-eld commented 7 months ago

I use a Delta bfb1012hh blower

Thank you for the hint!

Yes, I compiled llama.cpp with it, but also works with the clang 17 that comes with Rocm 6

Well, it does not work for me... afaik ROCm 6 on Rawhide uses clang 17 as well and I get the crash there. Sometimes it will already start replying and crash inmid of the sentence, when runnng in gdb it will crash right away after loading the model (backtrace above).

Which distro are you using?

8XXD8 commented 7 months ago

Which distro are you using?

I'm using Debian 12, but with kernel 6.8, Rocm is installed without the dkms driver, I'm using the built in amdgpu driver.

afaik ROCm 6 on Rawhide uses clang 17

Well, you had main: built with clang version 18.1.0 (Fedora 18.1.0~rc4-2.fc41) for x86_64-redhat-linux-gnu in your log. Rocm-llvm is installed to /opt/rocm/llvm/bin

jin-eld commented 7 months ago

Well, you had main: built with clang version 18.1.0 (Fedora 18.1.0~rc4-2.fc41) for x86_64-redhat-linux-gnu in your log.

Oops, thanks for pointing that out, I guess I kept updating Rawhide and did "overshoot" the upcoming F40 release, I totally missed that, will downgrade.

I am on kernel 6.8, also using the builtin amdgpu driver.

8XXD8 commented 7 months ago

I think you need a Rocm specific compiler, not the regular clang. For me, the log looks like main: built with AMD clang version 17.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-6.0.2 24012 af27734ed982b52a9f1be0f035ac91726fc697e4) for x86_64-unknown-linux-gnu, this version comes from the rocm-llvm package, that comes with the Rocm install

jin-eld commented 7 months ago

@8XXD8 thank you for the hints once more, together with the help of Fedora folks from Fedora ai/ ml I think I finally figured it out. Your pointer to AMD's llvm was one important piece of the puzzle, so for everyone who is building llama.cpp on Fedora: you need to point both CC and CXX to hipcc which is a clang wrapper that makes sure that AMD/ROCm llvm pieces will be used.

Setup the environment for gfx900 (this is Fedora specific):

module load rocm/gfx9

So my cmake setup looks like this now:

CC=/usr/bin/hipcc CXX=/usr/bin/hipcc cmake .. -DLLAMA_HIPBLAS=ON -DAMDGPU_TARGETS=gfx900 -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_FLAGS="--rocm-device-lib-path=/usr/lib/clang/17/amdgcn/bitcode"

And now the crash is gone, so I'll close the issue - this was not a llama.cpp bug, but a user error on my side.

@8XXD8 I'd still like to compare with your results though if I may, this nvtop graph looks a bit strange to me, I would have expected that the GPU to be at 100% all the time, does it look the same for you?

I was testing with the codebooga model:

llm_load_print_meta: model type       = 34B
llm_load_print_meta: model ftype      = Q5_K - Medium
llm_load_print_meta: model params     = 33.74 B
llm_load_print_meta: model size       = 22.20 GiB (5.65 BPW) 
llm_load_print_meta: general.name     = oobabooga_codebooga-34b-v0.1

and could fit 30 layers into VRAM:

llm_load_tensors: offloading 30 repeating layers to GPU
llm_load_tensors: offloaded 30/49 layers to GPU
llm_load_tensors:      ROCm0 buffer size = 13949.06 MiB
llm_load_tensors:        CPU buffer size = 22733.73 MiB

My test line is now:

./bin/main -t 16 -ngl 30 -sm none -m ~/Work/text-generation-webui/models/codebooga-34b-v0.1.Q5_K_M.gguf --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "### Instruction: How do I get the length of a Vec in Rust?\n### Response:"

And I am getting these timings:

llama_print_timings:        load time =    4296.82 ms
llama_print_timings:      sample time =      93.82 ms /   264 runs   (    0.36 ms per token,  2813.87 tokens per second)
llama_print_timings: prompt eval time =    2442.01 ms /    24 tokens (  101.75 ms per token,     9.83 tokens per second)
llama_print_timings:        eval time =   58858.64 ms /   263 runs   (  223.80 ms per token,     4.47 tokens per second)
llama_print_timings:       total time =   61462.25 ms /   287 tokens

Is this comparable to your speeds and does your graph also look that "spikey"?

slaren commented 7 months ago

@jin-eld Can you test if with the changes in https://github.com/ggerganov/llama.cpp/pull/5966 is easier to build with HIP?

slaren commented 7 months ago

Is this comparable to your speeds and does your graph also look that "spikey"?

The spikes are normal when offloading a model partially, because the GPU is idle while the CPU is processing its part of the model.

jin-eld commented 7 months ago

Closing, not a bug, solution in https://github.com/ggerganov/llama.cpp/issues/6031#issuecomment-1995958369

ggerganov / llama.cpp

Segmentation fault during inference on AMD gfx900 with codebooga-34b-v0.1.Q5_K_M.gguf #6031