llama.cpp Windows/ROCm builds are broken? Using shared GPU memory instead of dedicated.

SteelPh0enix commented 1 month ago

Discussed in https://github.com/ggerganov/llama.cpp/discussions/9960

^{Originally posted by **SteelPh0enix** October 20, 2024} I've been using llama.cpp w/ ROCm 6.1.2 on latest Windows 11 for quite a while. My hardware setup is RX 7900XT (gfx1100, 20GB VRAM), paried with 32GB RAM and Ryzen 9 5900X. Recently (this month), i've noticed that latest builds have been performing extremely bad compared to previous ones - with a magnitude slower inference, and it happens only on Windows - i'm also using llama.cpp on Arch Linux w/ ROCm 6.0.2, on the same hardware, without any performance-related issues whatsoever, so i assume it's Windows-specific bug. Note that i haven't changed anything in my hardware setup and OS. I did modify the way i'm building llama.cpp recently, but not in a major way, and i've checked with "old" building method (attached below) that has worked previously. I have noticed that any model i'm trying to load gets pushed to *shared* GPU memory (per Task Manager), instead of *dedicated* one, per screenshot below. Since shared memory is reported as 16GB, this GPU has 20GB VRAM (which is the same as reported dedicated memory), and RAM spikes to maximum... i guess that's just RAM, which makes the inference incredibly slow due to memory speed contraints. ![wezterm-gui_ZCpOP8k6ft](https://github.com/user-attachments/assets/afd6a6bd-f066-490b-b066-d3f8b629dc3e) I would like to present the results of `llama-benchmark` here, but i'm unable to finish it in *reasonable* time - it gives me ~500 score in pp512 (it's well above 1000 on Linux build), and then just hangs trying to do warmup for generation. I've managed to test the inference speed with llama-server, but i've got worse results than on raw CPU inference. To check if this is an issue with recent build, i've rolled back to llama.cpp from around a month ago, which were the builds i knew for sure have been working fine, but... nope, still the same thing! Inference happens on GPU, but model is loaded to shared memory! I have no idea what could cause that. I'm attaching my build script for Windows i've been using for a long time with success. I'm using the same CMake flags on Linux. Can anyone explain what could i do to diagnoze and fix this issue? ```bat REM execute via VS native tools command line prompt REM make sure to clone the repo first, put this script next to the repo dir REM this script is configured for building llama.cpp w/ ROCm support REM for a system with Ryzen 9 5900X and RX 7900XT. REM Unless you have the exact same setup, you may need to change some flags REM and/or strings here. set AMDGPU_TARGETS="gfx1100" set HSA_OVERRIDE_GFX_VERSION="11.0.0" set ROCM_VERSION="6.1.2" set USE_ROCM=1 set ROCM_PATH=%HIP_PATH% set CMAKE_MODULE_PATH=%HIP_PATH%cmake;%CMAKE_MODULE_PATH% set CMAKE_PREFIX_PATH=%ROCM_PATH%;%CMAKE_PREFIX_PATH% set PATH=%ROCM_PATH%bin;C:\Strawberry\perl\bin;%PATH% set LLAMA_CPP_PYTHON_VENV_PATH=%USERPROFILE%\.llama.cpp.venv call %LLAMA_CPP_PYTHON_VENV_PATH%\Scripts\activate.bat cd %HOMEPATH%\.llama.cpp git fetch git clean -xddf git pull git submodule update --recursive git lfs pull REM update Python dependencies python -m pip install --upgrade pip setuptools wheel python -m pip install --upgrade sentencepiece transformers protobuf torch cmake -S . -B build -G Ninja^ -DCMAKE_BUILD_TYPE=Release^ -DCMAKE_CXX_COMPILER=clang++^ -DCMAKE_C_COMPILER=clang^ -DLLAMA_BUILD_TESTS=OFF^ -DLLAMA_BUILD_EXAMPLES=ON^ -DLLAMA_BUILD_SERVER=ON^ -DLLAMA_STANDALONE=ON^ -DLLAMA_CURL=OFF^ -DGGML_NATIVE=ON^ -DGGML_LTO=ON^ -DGGML_OPENMP=ON^ -DAMDGPU_TARGETS=%AMDGPU_TARGETS%^ -DGGML_HIPBLAS=ON^ -DGGML_CUDA_FORCE_CUBLAS=ON cmake --build build --config Releas[e --parallel 24 ```

After doing more testing, i've noticed two things:

First thing; i was quantizing models with --leave-output-tensor, which made my models run very slow under Linux too. That was a side-effect of my investigation, and i'm leaving that in case somebody else has this issue :)

Second thing, closely related to the issue: some models work just fine. In my first test, i was checking Qwen2.5 14B finetune, quantized to Q6_K. This model behaves the same on Windows whether i leave the output tensor as-is, or not, in terms of memory allocation. Not leaving output tensor increases performance a small bit, but it's not very noticeable due to memory constraints.

HOWEVER, LLaMA 3.2 3B quantized to Q8_0 works just fine, and is loaded to dedicated GPU memory! What's going on here?

SteelPh0enix commented 1 month ago

Small update; I've confirmed that this bug does not happen when using Vulkan as a backend.

qh0814 commented 4 weeks ago

Which version of the driver are you using? I encountered the same issue, but everything worked smoothly after I downgraded to 24.5.1.

SteelPh0enix commented 4 weeks ago

Which version of the driver are you using? I encountered the same issue, but everything worked smoothly after I downgraded to 24.5.1.

Currently on 24.9.1 - and yeah, that might be it! I have pending update to 24.10.1, i'll see if it works - if not, i'll try downgrading.

EDIT: It's still loading to shared memory on 24.10.1, i'll downgrade the driver soon and verify whether it's the issue

YellowRoseCx commented 1 week ago

does the issue happen with the koboldcpp-rocm fork? https://github.com/YellowRoseCx/koboldcpp-rocm

tigert2173 commented 1 week ago

Same issue here, I got 24.8 to work but having performance degrading overtime. 24.5.1 tried give ROCM errors, trying 24.7 and it seems to work, will have to watch it. 24.9 and 24.10 do not work either

sorasoras commented 1 week ago

ye, maybe this is the reason I experienced extreme slow down on my 7900XTX after B3666

SteelPh0enix commented 1 week ago

someone confirmed on the discussions forum that rolling back the runtime fixes the issue: https://github.com/ggerganov/llama.cpp/discussions/9960#discussioncomment-11141805

ggerganov / llama.cpp

llama.cpp Windows/ROCm builds are broken? Using shared GPU memory instead of dedicated. #9964

Discussed in https://github.com/ggerganov/llama.cpp/discussions/9960