Closed Djip007 closed 1 week ago
First in all case it is not good to use it on dGPU (it work but really slow) so only to activate on iGPU. It look to work well on my Ryzen 7940HX on linux.
We may have to get more benchmark to deside what to do.
If you want to help with benchmark, what I did: for Linux
# get PR (until it is merged.)
git clone https://github.com/ggerganov/llama.cpp.git llama.cpp_bench
cd llama.cpp_bench
git fetch origin pull/7414/head:benchmark
git checkout benchmark
# get Models: (to allow benchmark fare compare.)
cd ..
mkdir models
cd models
wget https://huggingface.co/Mozilla/Mistral-7B-Instruct-v0.2-llamafile/resolve/main/mistral-7b-instruct-v0.2.F16.llamafile
unzip mistral-7b-instruct-v0.2.F16.llamafile mistral-7b-instruct-v0.2.F16.gguf
rm mistral-7b-instruct-v0.2.F16.llamafile
wget https://huggingface.co/Mozilla/Mistral-7B-Instruct-v0.2-llamafile/resolve/main/mistral-7b-instruct-v0.2.Q8_0.llamafile
unzip mistral-7b-instruct-v0.2.Q8_0.llamafile mistral-7b-instruct-v0.2.Q8_0.gguf
rm mistral-7b-instruct-v0.2.Q8_0.llamafile
wget https://huggingface.co/Mozilla/Mistral-7B-Instruct-v0.2-llamafile/resolve/main/mistral-7b-instruct-v0.2.Q4_K_M.llamafile
unzip mistral-7b-instruct-v0.2.Q4_K_M.llamafile mistral-7b-instruct-v0.2.Q4_K_M.gguf
rm mistral-7b-instruct-v0.2.Q4_K_M.llamafile
wget https://huggingface.co/Mozilla/Mistral-7B-Instruct-v0.2-llamafile/resolve/main/mistral-7b-instruct-v0.2.BF16.llamafile
unzip mistral-7b-instruct-v0.2.BF16.llamafile mistral-7b-instruct-v0.2.BF16.gguf
rm mistral-7b-instruct-v0.2.BF16.llamafile
# build for CPU [n°0]
cd llama.cpp_bench
make clean
make -j16
# build for GPU
# - for ryzen 7040 gfx1103 is note "supported" use gfx1101 on linux
export HSA_OVERRIDE_GFX_VERSION=11.0.1
export GFX_HARDWARE=gfx1101
# - for other ???
# - weight on VRAM [n°1]
make clean
make -j16 LLAMA_HIPBLAS=1 AMDGPU_TARGETS=${GFX_HARDWARE}
# - weight on "UMA" [n°2]
make clean
make -j16 LLAMA_HIPBLAS=1 LLAMA_HIP_UMA=1 AMDGPU_TARGETS=gfx1101
# benchmark:
# - for CPU:
./llama-bench --mmap 1 -p 256,512,1024 \
-m ../models/mistral-7b-instruct-v0.2.Q4_K_M.gguf \
-m ../models/mistral-7b-instruct-v0.2.Q8_0.gguf \
-m ../models/mistral-7b-instruct-v0.2.F16.gguf \
-m ../models/mistral-7b-instruct-v0.2.BF16.gguf
# - for GPU:
./llama-bench --mmap 0 -p 256,512,1024 \
-m ../models/mistral-7b-instruct-v0.2.Q4_K_M.gguf \
-m ../models/mistral-7b-instruct-v0.2.Q8_0.gguf \
-m ../models/mistral-7b-instruct-v0.2.F16.gguf
Hardware: Ryzen 7940HS / 64Go export HSA_OVERRIDE_GFX_VERSION=11.0.1 export GFX_HARDWARE=gfx1101
model | size | params | backend | threads | test | t/s |
---|---|---|---|---|---|---|
llama 7B Q4_K - Medium | 4.07 GiB | 7.24 B | CPU | 8 | pp256 | 46.49 ± 0.57 |
llama 7B Q4_K - Medium | 4.07 GiB | 7.24 B | CPU | 8 | pp512 | 45.33 ± 0.06 |
llama 7B Q4_K - Medium | 4.07 GiB | 7.24 B | CPU | 8 | pp1024 | 44.08 ± 0.21 |
llama 7B Q4_K - Medium | 4.07 GiB | 7.24 B | CPU | 8 | tg128 | 12.91 ± 0.04 |
llama 7B Q4_K - Medium | 4.07 GiB | 7.24 B | CPU | 8 | pp512+tg128 | 29.58 ± 0.08 |
llama 7B Q8_0 | 7.17 GiB | 7.24 B | CPU | 8 | pp256 | 57.58 ± 0.10 |
llama 7B Q8_0 | 7.17 GiB | 7.24 B | CPU | 8 | pp512 | 55.43 ± 0.07 |
llama 7B Q8_0 | 7.17 GiB | 7.24 B | CPU | 8 | pp1024 | 54.98 ± 0.04 |
llama 7B Q8_0 | 7.17 GiB | 7.24 B | CPU | 8 | tg128 | 7.39 ± 0.09 |
llama 7B Q8_0 | 7.17 GiB | 7.24 B | CPU | 8 | pp512+tg128 | 23.86 ± 0.25 |
llama 7B F16 | 13.49 GiB | 7.24 B | CPU | 8 | pp256 | 54.08 ± 0.24 |
llama 7B F16 | 13.49 GiB | 7.24 B | CPU | 8 | pp512 | 43.61 ± 0.05 |
llama 7B F16 | 13.49 GiB | 7.24 B | CPU | 8 | pp1024 | 43.33 ± 0.08 |
llama 7B F16 | 13.49 GiB | 7.24 B | CPU | 8 | tg128 | 3.96 ± 0.01 |
llama 7B F16 | 13.49 GiB | 7.24 B | CPU | 8 | pp512+tg128 | 14.50 ± 0.02 |
llama 7B BF16 | 13.49 GiB | 7.24 B | CPU | 8 | pp256 | 39.65 ± 0.20 |
llama 7B BF16 | 13.49 GiB | 7.24 B | CPU | 8 | pp512 | 39.11 ± 0.01 |
llama 7B BF16 | 13.49 GiB | 7.24 B | CPU | 8 | pp1024 | 38.44 ± 0.21 |
llama 7B BF16 | 13.49 GiB | 7.24 B | CPU | 8 | tg128 | 4.06 ± 0.01 |
llama 7B BF16 | 13.49 GiB | 7.24 B | CPU | 8 | pp512+tg128 | 14.33 ± 0.01 |
GPU-VRAM (n°1): (not enough VRAM for me)
GPU-UMA (n°2):
model | size | params | backend | ngl | mmap | test | t/s |
---|---|---|---|---|---|---|---|
llama 7B Q4_K - Medium | 4.07 GiB | 7.24 B | ROCm | 99 | 0 | pp256 | 220.96 ± 0.94 |
llama 7B Q4_K - Medium | 4.07 GiB | 7.24 B | ROCm | 99 | 0 | pp512 | 196.54 ± 0.15 |
llama 7B Q4_K - Medium | 4.07 GiB | 7.24 B | ROCm | 99 | 0 | pp1024 | 192.39 ± 0.33 |
llama 7B Q4_K - Medium | 4.07 GiB | 7.24 B | ROCm | 99 | 0 | tg128 | 15.08 ± 0.09 |
llama 7B Q4_K - Medium | 4.07 GiB | 7.24 B | ROCm | 99 | 0 | pp512+tg128 | 56.49 ± 0.04 |
llama 7B Q8_0 | 7.17 GiB | 7.24 B | ROCm | 99 | 0 | pp256 | 213.08 ± 0.22 |
llama 7B Q8_0 | 7.17 GiB | 7.24 B | ROCm | 99 | 0 | pp512 | 194.34 ± 0.57 |
llama 7B Q8_0 | 7.17 GiB | 7.24 B | ROCm | 99 | 0 | pp1024 | 190.32 ± 0.45 |
llama 7B Q8_0 | 7.17 GiB | 7.24 B | ROCm | 99 | 0 | tg128 | 10.37 ± 0.01 |
llama 7B Q8_0 | 7.17 GiB | 7.24 B | ROCm | 99 | 0 | pp512+tg128 | 41.89 ± 0.04 |
llama 7B F16 | 13.49 GiB | 7.24 B | ROCm | 99 | 0 | pp256 | 271.43 ± 1.39 |
llama 7B F16 | 13.49 GiB | 7.24 B | ROCm | 99 | 0 | pp512 | 215.60 ± 0.41 |
llama 7B F16 | 13.49 GiB | 7.24 B | ROCm | 99 | 0 | pp1024 | 209.43 ± 0.34 |
llama 7B F16 | 13.49 GiB | 7.24 B | ROCm | 99 | 0 | tg128 | 5.01 ± 0.36 |
llama 7B F16 | 13.49 GiB | 7.24 B | ROCm | 99 | 0 | pp512+tg128 | 21.40 ± 0.23 |
@jart https://github.com/Mozilla-Ocho/llamafile/issues/441#issuecomment-2131841506
As for
LLAMA_HIP_UMA=1
do you know what, if anything, it'll do to environments that don't have this? If you know how to detect it at runtime, I could change ggml-cuda to runtime dispatch to the right implementation.
What I test is add option: ./llamafile --recompile --use_hip_uma
and change the args use for rebuild.
I don't know what to detect... all AMD APU ?
But main consern with this option is use GTT over VRAM ... but there is a new update in linux kernel 6.10: https://www.phoronix.com/news/Linux-6.10-AMDKFD-Small-APUs remove this need... (don't know what happen for Windows.) so may be a simple option for rebuild is good.
There is even more interesting thing after we did some POC ( https://github.com/ggerganov/llama.cpp/issues/7399#issuecomment-2128263043 ) it look we can leave weight mmap in place with good perf. But it is more complicate to do it properly...
Make a POC on that: https://github.com/Djip007/llamafile/tree/feature/hip_uma
add --use_hip_uma
to use with --recompile
Do you want I make a MergeRequest?
OK some other benchmark...
# GPU: HSA_OVERRIDE_GFX_VERSION=11.0.1 llamafile -m mixtral-8x7b-instruct-v0.1.Q6_K.gguf -ngl 9999 --no-mmap --recompile --use-hip-uma --temp 0 -c 2048 -p "..."
llama_print_timings: prompt eval time = 15434.50 ms / 1466 tokens ( 10.53 ms per token, 94.98 tokens per second)
llama_print_timings: eval time = 85566.43 ms / 535 runs ( 159.94 ms per token, 6.25 tokens per second)
#> CPU: llamafile -m mixtral-8x7b-instruct-v0.1.Q6_K.gguf -ngl 0 --temp 0 -c 2048 -p "..."
llama_print_timings: prompt eval time = 31892.26 ms / 1466 tokens ( 21.75 ms per token, 45.97 tokens per second)
llama_print_timings: eval time = 89044.50 ms / 449 runs ( 198.32 ms per token, 5.04 tokens per second)
I'm thinking of finding a way to allow it to be activated by default...
With AMD APU (like my Ryzen 7940HX) it is possible to use "UMA" to extand VRAM. And in my case I can't alloc more than 4Go of VRAM (bios config).
And with this (https://github.com/ggerganov/llama.cpp/issues/7399) it may be as fast as with VRAM (I can't do a full test because I can't allocate more than 4Go of VRAM with my config)
I can (:crossed_fingers: ) make a PR here but need to know what the best is to made it available.