Closed jin-eld closed 7 months ago
Rocm 6 doesn't support gfx900, the latest that still works is 5.7.3.
Rocm 6 doesn't support gfx900, the latest that still works is 5.7.3.
Yes and no - gfx900 will not get any further updates and fixes or features, that being said - support for it was not removed and it still works with ROCm 6.0.x. StableDiffusion, Kohya_ss, audiocraft, PyTorch - still work as expected. I could run the ROCm test suite tonight to recheck, but the applications I have been using so far work fine.
I tried too, anything compiled for 5.7 works with Rocm 6, but when compiled with 6 it breaks. Pytorch nightly doesn't work, since it is compiled for Rocm 6. Some parts of Rocm 6 are not built for gfx900, like RocSolver, I manually added it to the Cmakelists.txt and it builds fine, but building the entire Rocm suite from source is a gigantic hassle.
but building the entire Rocm suite from source is a gigantic hassle.
absolutely true, been there :) However, Fedora 40 will ship ROCm 6.0 by default, I am already testing it on Rawhide and it works fine (some environment variable quirks are required).
PyTorch nightly for ROCm 6 works for me on Rawhide as well. I can check tonight if the rocsolver rpm includes gfx900, but afair the Fedora packages build and support all available GPU architectures.
We will see that new ROCm libraries will not support gfx900 at all, like hipBLASLt
, but at least from what I can see, stuff that worked for the gfx900 in 5.7 is still there in 6.0. When switching from 5.7 to 6.0 you need to disable SDMA on gfx900 export HSA_ENABLE_SDMA=0
, see https://github.com/ROCm/ROCm/issues/2781
Thanks for the tip, export HSA_ENABLE_SDMA=0
did it for me, now Rocm 6 works with my MI25.
I compiled llama.cpp with make clean && CC=/home/user/llvm/bin/clang CXX=/home/user/llvm/bin/clang++ make main LLAMA_HIPBLAS=1 AMDGPU_TARGETS="gfx900;gfx906" -j64
and can confirm that codebooga-34b-v0.1.Q5_K_M.gguf
works with both multi gpu, and partial offload too.
I think your problem might be the clang version, Rocm 6 comes with 17, and my version is built from the latest https://github.com/ROCm/llvm-project
.
now Rocm 6 works with my MI25
ah, another poor soul on MI25, how did you manage to cool these damn things, I am still struggling with that :)
I think your problem might be the clang version, Rocm 6 comes with 17, and my version is built from the latest https://github.com/ROCm/llvm-project.
Do I understand correctly - you compiled HEAD of https://github.com/ROCm/llvm-project
and used it to build llama.cpp or did you rebuild the whole ROCm suite with it?
By the way I ran rvs
today (ROCm validation suite) and it finished without any errors, so ROCm-wise everything should work...
how did you manage to cool these damn things
I use a Delta bfb1012hh blower
Do I understand correctly - you compiled HEAD of
https://github.com/ROCm/llvm-project
and used it to build llama.cpp
Yes, I compiled llama.cpp with it, but also works with the clang 17 that comes with Rocm 6
I use a Delta bfb1012hh blower
Thank you for the hint!
Yes, I compiled llama.cpp with it, but also works with the clang 17 that comes with Rocm 6
Well, it does not work for me... afaik ROCm 6 on Rawhide uses clang 17 as well and I get the crash there. Sometimes it will already start replying and crash inmid of the sentence, when runnng in gdb
it will crash right away after loading the model (backtrace above).
Which distro are you using?
Which distro are you using?
I'm using Debian 12, but with kernel 6.8, Rocm is installed without the dkms driver, I'm using the built in amdgpu driver.
afaik ROCm 6 on Rawhide uses clang 17
Well, you had main: built with clang version 18.1.0 (Fedora 18.1.0~rc4-2.fc41) for x86_64-redhat-linux-gnu
in your log.
Rocm-llvm is installed to /opt/rocm/llvm/bin
Well, you had main: built with clang version 18.1.0 (Fedora 18.1.0~rc4-2.fc41) for x86_64-redhat-linux-gnu in your log.
Oops, thanks for pointing that out, I guess I kept updating Rawhide and did "overshoot" the upcoming F40 release, I totally missed that, will downgrade.
I am on kernel 6.8, also using the builtin amdgpu driver.
I think you need a Rocm specific compiler, not the regular clang.
For me, the log looks like main: built with AMD clang version 17.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-6.0.2 24012 af27734ed982b52a9f1be0f035ac91726fc697e4) for x86_64-unknown-linux-gnu
, this version comes from the rocm-llvm
package, that comes with the Rocm install
@8XXD8 thank you for the hints once more, together with the help of Fedora folks from Fedora ai/ ml I think I finally figured it out. Your pointer to AMD's llvm was one important piece of the puzzle, so for everyone who is building llama.cpp
on Fedora: you need to point both CC
and CXX
to hipcc which is a clang wrapper that makes sure that AMD/ROCm llvm pieces will be used.
Setup the environment for gfx900 (this is Fedora specific):
module load rocm/gfx9
So my cmake setup looks like this now:
CC=/usr/bin/hipcc CXX=/usr/bin/hipcc cmake .. -DLLAMA_HIPBLAS=ON -DAMDGPU_TARGETS=gfx900 -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_FLAGS="--rocm-device-lib-path=/usr/lib/clang/17/amdgcn/bitcode"
And now the crash is gone, so I'll close the issue - this was not a llama.cpp
bug, but a user error on my side.
@8XXD8 I'd still like to compare with your results though if I may, this nvtop graph looks a bit strange to me, I would have expected that the GPU to be at 100% all the time, does it look the same for you?
I was testing with the codebooga model:
llm_load_print_meta: model type = 34B
llm_load_print_meta: model ftype = Q5_K - Medium
llm_load_print_meta: model params = 33.74 B
llm_load_print_meta: model size = 22.20 GiB (5.65 BPW)
llm_load_print_meta: general.name = oobabooga_codebooga-34b-v0.1
and could fit 30 layers into VRAM:
llm_load_tensors: offloading 30 repeating layers to GPU
llm_load_tensors: offloaded 30/49 layers to GPU
llm_load_tensors: ROCm0 buffer size = 13949.06 MiB
llm_load_tensors: CPU buffer size = 22733.73 MiB
My test line is now:
./bin/main -t 16 -ngl 30 -sm none -m ~/Work/text-generation-webui/models/codebooga-34b-v0.1.Q5_K_M.gguf --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "### Instruction: How do I get the length of a Vec in Rust?\n### Response:"
And I am getting these timings:
llama_print_timings: load time = 4296.82 ms
llama_print_timings: sample time = 93.82 ms / 264 runs ( 0.36 ms per token, 2813.87 tokens per second)
llama_print_timings: prompt eval time = 2442.01 ms / 24 tokens ( 101.75 ms per token, 9.83 tokens per second)
llama_print_timings: eval time = 58858.64 ms / 263 runs ( 223.80 ms per token, 4.47 tokens per second)
llama_print_timings: total time = 61462.25 ms / 287 tokens
Is this comparable to your speeds and does your graph also look that "spikey"?
@jin-eld Can you test if with the changes in https://github.com/ggerganov/llama.cpp/pull/5966 is easier to build with HIP?
Is this comparable to your speeds and does your graph also look that "spikey"?
The spikes are normal when offloading a model partially, because the GPU is idle while the CPU is processing its part of the model.
Closing, not a bug, solution in https://github.com/ggerganov/llama.cpp/issues/6031#issuecomment-1995958369
Hi,
I compiled
llama.cpp
from git, todays master HEADcommit 8030da7afea2d89f997aeadbd14183d399a017b9
on Fedora Rawhide (ROCm 6.0.x) like this:Then I tried to run a prompt using the
codebooga-34b-v0.1.Q5_K_M.gguf
model which I got from here: https://huggingface.co/TheBloke/CodeBooga-34B-v0.1-GGUFI kept the prompt simple and used the following command: ./main -t 10 -ngl 16 -m ~/models/codebooga-34b-v0.1.Q5_K_M.gguf --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "### Instruction: How do I get the length of a Vec in Rust?\n### Response:"
I have an AMD Instinct MI25 card with 16GB VRAM, according to
nvtop
with-ngl 16
about half of it is used8.219Gi/15.984
, so this does not seem to be an OOM issue.The console output looks like this:
Shortly after I get a segfault, although sometimes it starts responding and crashes a few seconds into the response:
I saw some issues about partial offloading and also tried a smaller model which should completely fit on my GPU, but the segfault was still there, the smaller model is this one:
Crashed as well with a very similar backtrace.
Since this is nicely reproducible, I can provide more more info or add some debug logs as needed, please let me know what you need.