ggerganov / llama.cpp

LLM inference in C/C++
MIT License
65.58k stars 9.41k forks source link

HIPBLAS / ROCm low prompt eval performance #7533

Closed etemiz closed 1 month ago

etemiz commented 4 months ago

I have two MI60's that don't perform well during prompt evaluation. What could be the reason?

Model Llama3-70B Q6:

llama_print_timings: prompt eval time = 3722.63 ms / 18 tokens ( 206.81 ms per token, 4.84 tokens per second) llama_print_timings: eval time = 4274.60 ms / 35 runs ( 122.13 ms per token, 8.19 tokens per second)

compile:

HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" cmake -S . -B build -DLLAMA_HIPBLAS=ON -DAMDGPU_TARGETS=gfx906 -DCMAKE_BUILD_TYPE=Release && cmake --build build --config Release -- -j 16

ROCk module version 6.7.0

When using an 8B model Q8, it does this:

llama_print_timings: prompt eval time = 200.58 ms / 18 tokens ( 11.14 ms per token, 89.74 tokens per second) llama_print_timings: eval time = 1819.74 ms / 94 runs ( 19.36 ms per token, 51.66 tokens per second)

I also did this hack https://github.com/ggerganov/llama.cpp/issues/3772#issuecomment-2012606521 which fixed the garbled output issue but I don't know if it is related.

Now I am wondering if it is a 6 bit quantization issue..

Thank you!

Engininja2 commented 3 months ago

The mul_mat_q kernels have tuning configurations for a few different GPU architectures. Right now anything with a smaller ISA version than RDNA2 gets the settings for RDNA1. RDNA1/gfx1010 has more VGPRs than gfx906 does so when compiled for gfx906 the kernels end up spilling a lot of VGPRs to scratch memory, which is slow.

For example mul_mat_q6_K() has a vgpr_spill_count of 200 registers compiled for gfx906, where gfx1010 has 0. When compiling with make you can set HIPFLAGS="-save-temps" to get a *.s file for each HIP source file that contains the resulting assembly and stats like the spill count and the max occupancy of a kernel on the GPU.

Someone with a Vega GPU will need to try new values for x, y, and nwarps for each mmq kernel, and I think that __launch_bounds__() will need to be set for Vega cards to get the compiler to limit occupancy to 2. The default looks like it's 4. The advantage of limiting occupancy is that this makes more VGPRs available, but one disadvantage is that the GPU has more trouble hiding memory latency with fewer waves in flight, so it needs testing to see which is better.

github-actions[bot] commented 1 month ago

This issue was closed because it has been inactive for 14 days since being marked as stale.