AMD - tinyBLAS windows prebuilt support stopped working with 0.8.5

Mozilla-Ocho / llamafile

Distribute and run LLMs with a single file.

https://llamafile.ai

Other

20.41k stars 1.03k forks source link

AMD - tinyBLAS windows prebuilt support stopped working with 0.8.5 #441

Closed jeromew closed 4 months ago

jeromew commented 5 months ago

Hello,

on my computer, with an "AMD 6700 XT" graphics card, the tinyBLAS is working with 0.8.4.

now with 0.8.5 it says

import_cuda_impl: initializing gpu module...
extracting /zip/llama.cpp/ggml.h to /C/Users/ordib/.llamafile/v/0.8.5/ggml.h
extracting /zip/llamafile/compcap.cu to /C/Users/ordib/.llamafile/v/0.8.5/compcap.cu
extracting /zip/llamafile/llamafile.h to /C/Users/ordib/.llamafile/v/0.8.5/llamafile.h
extracting /zip/llamafile/tinyblas.h to /C/Users/ordib/.llamafile/v/0.8.5/tinyblas.h
extracting /zip/llamafile/tinyblas.cu to /C/Users/ordib/.llamafile/v/0.8.5/tinyblas.cu
extracting /zip/llama.cpp/ggml-impl.h to /C/Users/ordib/.llamafile/v/0.8.5/ggml-impl.h
extracting /zip/llama.cpp/ggml-cuda.h to /C/Users/ordib/.llamafile/v/0.8.5/ggml-cuda.h
extracting /zip/llama.cpp/ggml-alloc.h to /C/Users/ordib/.llamafile/v/0.8.5/ggml-alloc.h
extracting /zip/llama.cpp/ggml-common.h to /C/Users/ordib/.llamafile/v/0.8.5/ggml-common.h
extracting /zip/llama.cpp/ggml-backend.h to /C/Users/ordib/.llamafile/v/0.8.5/ggml-backend.h
extracting /zip/llama.cpp/ggml-backend-impl.h to /C/Users/ordib/.llamafile/v/0.8.5/ggml-backend-impl.h
extracting /zip/llama.cpp/ggml-cuda.cu to /C/Users/ordib/.llamafile/v/0.8.5/ggml-cuda.cu
get_rocm_bin_path: note: amdclang++.exe not found on $PATH
get_rocm_bin_path: note: $HIP_PATH/bin/amdclang++.exe does not exist
get_rocm_bin_path: note: /opt/rocm/bin/amdclang++.exe does not exist
get_rocm_bin_path: note: clang++.exe not found on $PATH
get_rocm_bin_path: note: $HIP_PATH/bin/clang++.exe does not exist
get_rocm_bin_path: note: /opt/rocm/bin/clang++.exe does not exist
import_cuda_impl: won't compile AMD GPU support because $HIP_PATH/bin/clang++ is missing
extract_cuda_dso: note: prebuilt binary /zip/ggml-rocm.dll not found
get_nvcc_path: note: nvcc.exe not found on $PATH
get_nvcc_path: note: $CUDA_PATH/bin/nvcc.exe does not exist
get_nvcc_path: note: /opt/cuda/bin/nvcc.exe does not exist
get_nvcc_path: note: /usr/local/cuda/bin/nvcc.exe does not exist
extracting /zip/ggml-cuda.dll to /C/Users/ordib/.llamafile/v/0.8.5/ggml-cuda.dll
link_cuda_dso: note: dynamically linking /C/Users/ordib/.llamafile/v/0.8.5/ggml-cuda.dll
link_cuda_dso: warning: library not found: failed to load library

the file is present in /C/Users/ordib/.llamafile/v/0.8.5/ggml-cuda.dll when I look in the directory. in the 0.8.4 version the file is /C/Users/ordib/.llamafile/ggml-cuda.dll and it loads correctly, logging that tinyBLAS was setup.

note: I tried both with -ngl 35 and -ngl 9999 - not sure what is the correct way now for AMD/tinyBLAS support

tell me if you need more information to understand what is the difference between 0.8.4 and 0.8.5 on this issue

jart commented 5 months ago

Was llamafile v0.8.4 prebuilt amd gpu support working for you on Windows?

I wasn't able to include prebuild AMD GPU support for Windows users in the recent release for a couple reasons, one of which being https://github.com/ggerganov/llama.cpp/issues/7156.

There's a workaround you should be able to use. You need to install the AMD ROCm "HIP SDK" on your computer. Once that's installed, llamafile will compile just for your machine automatically a highly-optimized GPU module that'll give you a better experience.

jeromew commented 5 months ago

yes it was the prebuilt amd gpu support that was working with 0.8.4

I understand that this is all moving very fast ; thank you for your help.

according to https://rocm.docs.amd.com/projects/install-on-windows/en/latest/reference/system-requirements.html

so it seems I can only have the HIP runtime and not the SDK

does that mean I am out of luck with my GPU ?

I will try installing the HIP SDK anyway and report here what happens

jart commented 5 months ago

You're not out of luck. Me and the llama.cpp developers are working on finding a way to reduce the code size, so we can include the prebuilt ggml-rocm.dll for you in a future release real soon. I recommend just using 0.8.4 for a few weeks until that happens. Sound good?

jeromew commented 5 months ago

yes I will continue using 0.8.4 for now.

so you think it won't work on my setup with "RX 6700 XT" even if I install the HIP SDK ? it is true that for this card the amd spec page only talks about "runtime" compatibility so I guess that excludes the just-in-time compilation that you described.

libraries are described on https://rocm.docs.amd.com/en/latest/reference/api-libraries.html

I am not sure I understand what is the needed ROCm component as a dependency for the just-in-time GPU support compilation. Is it the C++ libraries that are mentioned as

As another solution that would not involve a GGML_MINIMIZE_CODE_SIZE flag, could it be possible maybe to have the prebuilt amd gpu support compiled into the standalone llamafile (I mean the llamafile.exe that need the -m option and a gguf file) ? ; I have started using this as some models I tested don't fit under the 4Gb limit) so it would be ok for me if the tinyBLAS AMD windows prebuilt support was only available in the standalone llamafile runtime.

jart commented 5 months ago

Here's the Windows AMD GPU DSO I built for the last release that wasn't included. You can use zipalign to include it yourself if you can make it fit. ggml-rocm.dll.zip

I don't know what specific component is needed from ROCm. If you're proposing we bundle AMD's DSOs in our llamafile releases, I'd be reluctant to do that. I'm already unhappy about how the address space has to be tainted in order to talk to GPUs. I don't know how we'd call this project open source if our release artifacts were tainted too.

Djip007 commented 5 months ago

so you think it won't work on my setup with "RX 6700 XT" even if I install the HIP SDK ?

I don't have Windows, but on linux to rebuild you need : 1- if use --recompile --tinyblas => only need HIP SDK no lib at all 2- if use --recompile => need HIP SDK + hipBlas + RocBlas (SDK...)

(more element here: https://github.com/Mozilla-Ocho/llamafile/issues/188)

Note: I need to find time to test with last llamafile it may change.

Djip007 commented 5 months ago

quick test (on linux / AMD Ryzen 9 5950X + AMD Radeon RX 6900 XT)

> ./llamafile-0.8.6 -m Mistral-7b-instruct-v0.2.F16.llamafile                                  --temp 0.7 -p '[INST]Write a story about llamas. Please write it in french for a young girle that nead to go to bed.[/INST]'
llama_print_timings: prompt eval time =     556.46 ms /    33 tokens (   16.86 ms per token,    59.30 tokens per second)
llama_print_timings:        eval time =   37776.07 ms /   133 runs   (  284.03 ms per token,     3.52 tokens per second)

> ./llamafile-0.8.6 -m Mistral-7b-instruct-v0.2.F16.llamafile -ngl 9999 --nocompile            --temp 0.7 -p '[INST]Write a story about llamas. Please write it in french for a young girle that nead to go to bed.[/INST]'
llama_print_timings: prompt eval time =     229.94 ms /    33 tokens (    6.97 ms per token,   143.52 tokens per second)
llama_print_timings:        eval time =   73144.58 ms /  1411 runs   (   51.84 ms per token,    19.29 tokens per second)

> ./llamafile-0.8.6 -m Mistral-7b-instruct-v0.2.F16.llamafile -ngl 9999 --recompile --tinyblas --temp 0.7 -p '[INST]Write a story about llamas. Please write it in french for a young girle that nead to go to bed.[/INST]'
llama_print_timings: prompt eval time =     233.25 ms /    33 tokens (    7.07 ms per token,   141.48 tokens per second)
llama_print_timings:        eval time =   38342.75 ms /   811 runs   (   47.28 ms per token,    21.15 tokens per second)

> ./llamafile-0.8.6 -m Mistral-7b-instruct-v0.2.F16.llamafile -ngl 9999 --recompile            --temp 0.7 -p '[INST]Write a story about llamas. Please write it in french for a young girle that nead to go to bed.[/INST]'
llama_print_timings: prompt eval time =     119.48 ms /    33 tokens (    3.62 ms per token,   276.20 tokens per second)
llama_print_timings:        eval time =   26408.86 ms /   583 runs   (   45.30 ms per token,    22.08 tokens per second)

for quick test use new release with old "weight". remove ./llamafile-0.8.6 -m if you have new complet file

with longer prompt:

> ./llamafile-0.8.6 -m Mistral-7b-instruct-v0.2.F16.llamafile -ngl 9999 --recompile -p "..."
llama_print_timings: prompt eval time =    1029.84 ms /  1466 tokens (    0.70 ms per token,  1423.53 tokens per second)
llama_print_timings:        eval time =   21118.46 ms /   432 runs   (   48.89 ms per token,    20.46 tokens per second)

> ./llamafile-0.8.6 -m Mistral-7b-instruct-v0.2.F16.llamafile -ngl 9999 --recompile --tinyblas  -p "..."
llama_print_timings: prompt eval time =    1852.72 ms /  1466 tokens (    1.26 ms per token,   791.27 tokens per second)
llama_print_timings:        eval time =   28902.66 ms /   518 runs   (   55.80 ms per token,    17.92 tokens per second)

jart commented 5 months ago

Thanks for posting your numbers!

Djip007 commented 5 months ago

V0.8.6 is really impressive for BF16 and Q6_K...

llamafile-bench-0.8.6 -p "256,512,1024" -m "mistral-7b-instruct-v0.2.BF16.gguf,mistral-7b-instruct-v0.2.F16.gguf,mistral-7b-instruct-v0.2.Q4_K_M.gguf,mistral-7b-instruct-v0.2.Q5_K_S.gguf,mistral-7b-instruct-v0.2.Q6_K.gguf,mistral-7b-instruct-v0.2.Q8_0.gguf"

cpu_info	model_filename	size	test	t/s
AMD Ryzen 9 7940HS (znver4)	mistral-7b-instruct-v0.2.BF16	13.49 GiB	pp256	102.51
AMD Ryzen 9 7940HS (znver4)	mistral-7b-instruct-v0.2.BF16	13.49 GiB	pp512	95.03
AMD Ryzen 9 7940HS (znver4)	mistral-7b-instruct-v0.2.BF16	13.49 GiB	pp1024	94.20
AMD Ryzen 9 7940HS (znver4)	mistral-7b-instruct-v0.2.BF16	13.49 GiB	tg16	4.00
AMD Ryzen 9 7940HS (znver4)	mistral-7b-instruct-v0.2.F16	13.49 GiB	pp256	63.04
AMD Ryzen 9 7940HS (znver4)	mistral-7b-instruct-v0.2.F16	13.49 GiB	pp512	61.92
AMD Ryzen 9 7940HS (znver4)	mistral-7b-instruct-v0.2.F16	13.49 GiB	pp1024	61.84
AMD Ryzen 9 7940HS (znver4)	mistral-7b-instruct-v0.2.F16	13.49 GiB	tg16	4.03
AMD Ryzen 9 7940HS (znver4)	mistral-7b-instruct-v0.2.Q8_0	7.17 GiB	pp256	53.83
AMD Ryzen 9 7940HS (znver4)	mistral-7b-instruct-v0.2.Q8_0	7.17 GiB	pp512	53.19
AMD Ryzen 9 7940HS (znver4)	mistral-7b-instruct-v0.2.Q8_0	7.17 GiB	pp1024	52.33
AMD Ryzen 9 7940HS (znver4)	mistral-7b-instruct-v0.2.Q8_0	7.17 GiB	tg16	7.26
AMD Ryzen 9 7940HS (znver4)	mistral-7b-instruct-v0.2.Q6_K	5.53 GiB	pp256	88.61
AMD Ryzen 9 7940HS (znver4)	mistral-7b-instruct-v0.2.Q6_K	5.53 GiB	pp512	85.63
AMD Ryzen 9 7940HS (znver4)	mistral-7b-instruct-v0.2.Q6_K	5.53 GiB	pp1024	82.87
AMD Ryzen 9 7940HS (znver4)	mistral-7b-instruct-v0.2.Q6_K	5.53 GiB	tg16	9.11
AMD Ryzen 9 7940HS (znver4)	mistral-7b-instruct-v0.2.Q5_K_S	4.65 GiB	pp256	64.35
AMD Ryzen 9 7940HS (znver4)	mistral-7b-instruct-v0.2.Q5_K_S	4.65 GiB	pp512	82.24
AMD Ryzen 9 7940HS (znver4)	mistral-7b-instruct-v0.2.Q5_K_S	4.65 GiB	pp1024	80.29
AMD Ryzen 9 7940HS (znver4)	mistral-7b-instruct-v0.2.Q5_K_S	4.65 GiB	tg16	11.33
AMD Ryzen 9 7940HS (znver4)	mistral-7b-instruct-v0.2.Q4_K_M	4.07 GiB	pp256	89.18
AMD Ryzen 9 7940HS (znver4)	mistral-7b-instruct-v0.2.Q4_K_M	4.07 GiB	pp512	82.77
AMD Ryzen 9 7940HS (znver4)	mistral-7b-instruct-v0.2.Q4_K_M	4.07 GiB	pp1024	83.03
AMD Ryzen 9 7940HS (znver4)	mistral-7b-instruct-v0.2.Q4_K_M	4.07 GiB	tg16	11.19

(you can compare with https://github.com/Mozilla-Ocho/llamafile/issues/439#issuecomment-2130079810 for actual llama.cpp )

Djip007 commented 5 months ago

is it possible to use llamafile-bench with GPU?

jart commented 5 months ago

Try passing the -fa for flash attention which makes it go even faster. I don't like the GPU implementation but the CPU impl is great. I'm able to get 961 tok/sec at prompt processing with Mistral on Threadripper Pro. That's a 20% speed boost for me. It's one of the most excellent performance optimizations I've seen from @ggerganov recently. Why not enable it by default?

llamafile-bench will support GPU soon. It's a bit trickier because llama-bench was designed in a way that assumes GPU support was figured out at compile-time. So it'll likely take some overhauling.

Djip007 commented 5 months ago

in my case it is slower... dit I made mistake?

#> ryzen 7940HS:
> ../../llamafile-0.8.6 -m mistral-7b-instruct-v0.2.BF16.gguf -fa -ngl 0 --temp 0 -c 2048 
llama_print_timings: prompt eval time =   18860.70 ms /  1466 tokens (   12.87 ms per token,    77.73 tokens per second)
llama_print_timings:        eval time =  120744.94 ms /   437 runs   (  276.30 ms per token,     3.62 tokens per second)

> ../../llamafile-0.8.6 -m mistral-7b-instruct-v0.2.BF16.gguf     -ngl 0 --temp 0 -c 2048 
llama_print_timings: prompt eval time =   17340.75 ms /  1466 tokens (   11.83 ms per token,    84.54 tokens per second)
llama_print_timings:        eval time =  103088.90 ms /   384 runs   (  268.46 ms per token,     3.72 tokens per second)

(for GPU I have to rebuild with LLAMA_HIP_UMA=1 after make some modification on llamafile)

jart commented 5 months ago

Interesting, so in some environments it can make things slower. I wonder why that is. Maybe that's why it isn't enabled by default. Thanks for sharing this. As for LLAMA_HIP_UMA=1 do you know what, if anything, it'll do to environments that don't have this? If you know how to detect it at runtime, I could change ggml-cuda to runtime dispatch to the right implementation.

Djip007 commented 5 months ago

for GPU

#> AMD Radeon RX 6900 XT:
> ./llamafile-0.8.6 -m Mistral-7b-instruct-v0.2.F16.llamafile -ngl 9999 --recompile -p "..."
llama_print_timings: prompt eval time =    1029.84 ms /  1466 tokens (    0.70 ms per token,  1423.53 tokens per second)
llama_print_timings:        eval time =   21118.46 ms /   432 runs   (   48.89 ms per token,    20.46 tokens per second)

> ./llamafile-0.8.6 -m Mistral-7b-instruct-v0.2.F16.llamafile -fa -ngl 9999 --recompile -p "..."
llama_print_timings: prompt eval time =    1298.41 ms /  1466 tokens (    0.89 ms per token,  1129.07 tokens per second)
llama_print_timings:        eval time =   25759.27 ms /   494 runs   (   52.14 ms per token,    19.18 tokens per second)

> ./llamafile-0.8.6 -m Mistral-7b-instruct-v0.2.F16.llamafile -ngl 9999 --recompile --tinyblas  -p "..."
llama_print_timings: prompt eval time =    1855.15 ms /  1466 tokens (    1.27 ms per token,   790.23 tokens per second)
llama_print_timings:        eval time =   21282.14 ms /   384 runs   (   55.42 ms per token,    18.04 tokens per second)

> ./llamafile-0.8.6 -m Mistral-7b-instruct-v0.2.F16.llamafile -fa -ngl 9999 --recompile --tinyblas  -p "..."
llama_print_timings: prompt eval time =    1923.64 ms /  1466 tokens (    1.31 ms per token,   762.10 tokens per second)
llama_print_timings:        eval time =   19991.50 ms /   384 runs   (   52.06 ms per token,    19.21 tokens per second)

jart commented 5 months ago

It looks like flash attention is still a work in progress for AMD GPUs. It's probably due to it being 6mb of code. AMD GPUs usually have smaller instruction caches and are more sensitive than NVIDIA to code size issues.

Djip007 commented 5 months ago

It looks like flash attention is still a work in progress for AMD GPUs. It's probably due to it being 6mb of code. AMD GPUs usually have smaller instruction caches and are more sensitive than NVIDIA to code size issues.

I need to go to bed... bud will add HIP_UMA (and the Optimisation) and test with that on ryzen 7940HS tomorrow.

Djip007 commented 5 months ago

OK made some patch (https://github.com/Djip007/llamafile/tree/feature/hip_uma)

first commit add --use-hip-uma => when rebuild use HIP_UMA for gpu memory alloc.
seconde add hipMemAdviseSetCoarseGrain for best perfo.

Some result: (BF16 on CPU FP16 on GPU)

#> ryzen 7940HS:
> ../../llamafile-0.8.6 -m mistral-7b-instruct-v0.2.BF16.gguf -fa -ngl 0 --temp 0 -c 2048 
llama_print_timings: prompt eval time =   18860.70 ms /  1466 tokens (   12.87 ms per token,    77.73 tokens per second)
llama_print_timings:        eval time =  120744.94 ms /   437 runs   (  276.30 ms per token,     3.62 tokens per second)

> ../../llamafile-0.8.6 -m mistral-7b-instruct-v0.2.BF16.gguf     -ngl 0 --temp 0 -c 2048 
llama_print_timings: prompt eval time =   17340.75 ms /  1466 tokens (   11.83 ms per token,    84.54 tokens per second)
llama_print_timings:        eval time =  103088.90 ms /   384 runs   (  268.46 ms per token,     3.72 tokens per second)

>>- with HIP_UMA
> ../../usr/bin/llamafile -m mistral-7b-instruct-v0.2.F16.gguf     -ngl 9999 --recompile --tinyblas --use-hip-uma --temp 0 -c 2048 -p "..."
llama_print_timings: prompt eval time =   31051.46 ms /  1466 tokens (   21.18 ms per token,    47.21 tokens per second)
llama_print_timings:        eval time =  138180.55 ms /   384 runs   (  359.85 ms per token,     2.78 tokens per second)

> HSA_OVERRIDE_GFX_VERSION=11.0.1 ../../usr/bin/llamafile -m mistral-7b-instruct-v0.2.F16.gguf     -ngl 9999 --recompile            --use-hip-uma --temp 0 -c 2048 -p "..."
llama_print_timings: prompt eval time =   14817.91 ms /  1466 tokens (   10.11 ms per token,    98.93 tokens per second)
llama_print_timings:        eval time =  157568.49 ms /   635 runs   (  248.14 ms per token,     4.03 tokens per second)

>>- with HIP_UMA+"CoarseGrain patch"
> ../../usr/bin/llamafile -m mistral-7b-instruct-v0.2.F16.gguf -fa -ngl 9999 --recompile --tinyblas --use-hip-uma --temp 0 -c 2048 -p "..."
llama_print_timings: prompt eval time =   12391.85 ms /  1466 tokens (    8.45 ms per token,   118.30 tokens per second)
llama_print_timings:        eval time =  102629.02 ms /   384 runs   (  267.26 ms per token,     3.74 tokens per second)

> ../../usr/bin/llamafile -m mistral-7b-instruct-v0.2.F16.gguf     -ngl 9999 --recompile --tinyblas --use-hip-uma --temp 0 -c 2048 -p "..."
llama_print_timings: prompt eval time =   11119.20 ms /  1466 tokens (    7.58 ms per token,   131.84 tokens per second)
llama_print_timings:        eval time =   83272.67 ms /   384 runs   (  216.86 ms per token,     4.61 tokens per second)

> HSA_OVERRIDE_GFX_VERSION=11.0.1 ../../usr/bin/llamafile -m mistral-7b-instruct-v0.2.F16.gguf -fa -ngl 9999 --recompile            --use-hip-uma --temp 0 -c 2048 -p "..."
llama_print_timings: prompt eval time =    9719.47 ms /  1466 tokens (    6.63 ms per token,   150.83 tokens per second)
llama_print_timings:        eval time =  114512.12 ms /   437 runs   (  262.04 ms per token,     3.82 tokens per second)

> HSA_OVERRIDE_GFX_VERSION=11.0.1 ../../usr/bin/llamafile -m mistral-7b-instruct-v0.2.F16.gguf     -ngl 9999 --recompile            --use-hip-uma --temp 0 -c 2048 -p "..."
llama_print_timings: prompt eval time =    7208.44 ms /  1466 tokens (    4.92 ms per token,   203.37 tokens per second)
llama_print_timings:        eval time =  101313.12 ms /   507 runs   (  199.83 ms per token,     5.00 tokens per second)

as you see:

you really make good job with BF16 on zen4.
-fa is bad on AMD GPU (RDNA2 and RDNA3-APU)

jeromew commented 5 months ago

Here's the Windows AMD GPU DSO I built for the last release that wasn't included. You can use zipalign to include it yourself if you can make it fit. ggml-rocm.dll.zip

I don't know what specific component is needed from ROCm. If you're proposing we bundle AMD's DSOs in our llamafile releases, I'd be reluctant to do that. I'm already unhappy about how the address space has to be tainted in order to talk to GPUs. I don't know how we'd call this project open source if our release artifacts were tainted too.

I tested the ggml-rocm.dll you provided by simply putting it in the .llamafile/v/0.8.5/ directory and it worked.

I am not totally familiar yet with the way the releases are built. I tought that 0.8.4 came bundled with the ggml-rocm.dll so my idea was that :

llamafile could come with ggml-rocm.dll
llava.llamafile could miss the ggml-rocm.dll if it out-lengths the 4GB limit

also I am not sure if this could help here (because of the needed alignments), but long ago on high compression requirements on windows I used https://upx.github.io/ with good success

jeromew commented 5 months ago

I tried to install HIP SDK (version 5.7)

it added a HIP_PATH environment variable with C:\Program Files\AMD\ROCm\5.7\bin

the available .exe are

so when llamafile looks for amdclang++.exe

get_rocm_bin_path: note: amdclang++.exe not found on $PATH get_rocm_bin_path: note: /C/Program Files/AMD/ROCm/5.7//bin/amdclang++.exe does not exist

then it looks for clang++.exe but cannot find it.

get_rocm_bin_path: note: clang++.exe not found on $PATH

should it look for the one in HIP_PATH ?

then it looks for hipInfo.exe in $PATH but cannot find it

get_rocm_bin_path: note: hipInfo.exe not found on $PATH

then it seems to find it in $HIP_PATH

llamafile_log_command: "/C/Program Files/AMD/ROCm/5.7//bin/hipInfo.exe"

but it does'nt seem to find a graphics card even though there is no doubt I have a AMD Radeon RX 6700 XT (arch gfx1031)


--------------------------------------------------------------------------------
device#                           0
Name:                             AMD Radeon RX 6700 XT
pciBusID:                         45
pciDeviceID:                      0
pciDomainID:                      0
multiProcessorCount:              20
maxThreadsPerMultiProcessor:      2048
isMultiGpuBoard:                  0
clockRate:                        2424 Mhz
memoryClockRate:                  1000 Mhz
memoryBusWidth:                   0
totalGlobalMem:                   11.98 GB
totalConstMem:                    2147483647
sharedMemPerBlock:                64.00 KB
canMapHostMemory:                 1
regsPerBlock:                     0
warpSize:                         32
l2CacheSize:                      4194304
computeMode:                      0
maxThreadsPerBlock:               1024
maxThreadsDim.x:                  1024
maxThreadsDim.y:                  1024
maxThreadsDim.z:                  1024
maxGridSize.x:                    2147483647
maxGridSize.y:                    65536
maxGridSize.z:                    65536
major:                            10
minor:                            3
concurrentKernels:                1
cooperativeLaunch:                0
cooperativeMultiDeviceLaunch:     0
isIntegrated:                     0
maxTexture1D:                     16384
maxTexture2D.width:               16384
maxTexture2D.height:              16384
maxTexture3D.width:               2048
maxTexture3D.height:              2048
maxTexture3D.depth:               2048
isLargeBar:                       0
asicRevision:                     0
maxSharedMemoryPerMultiProcessor: 64.00 KB
clockInstructionRate:             1000.00 Mhz
arch.hasGlobalInt32Atomics:       1
arch.hasGlobalFloatAtomicExch:    1
arch.hasSharedInt32Atomics:       1
arch.hasSharedFloatAtomicExch:    1
arch.hasFloatAtomicAdd:           1
arch.hasGlobalInt64Atomics:       1
arch.hasSharedInt64Atomics:       1
arch.hasDoubles:                  1
arch.hasWarpVote:                 1
arch.hasWarpBallot:               1
arch.hasWarpShuffle:              1
arch.hasFunnelShift:              0
arch.hasThreadFenceSystem:        1
arch.hasSyncThreadsExt:           0
arch.hasSurfaceFuncs:             0
arch.has3dGrid:                   1
arch.hasDynamicParallelism:       0
gcnArchName:                      gfx1031
peers:
non-peers:                        device#0

memInfo.total:                    11.98 GB
memInfo.free:                     11.86 GB (99%)

get_amd_offload_arch_flag: warning: hipInfo output didn't list any graphics cards

and then it tries to fallback on the prebuilt AMD GPU support on Windows but does not find it, which is normal for 0.8.5 and 0.8.6

Note that the "missing graphics card" problem is also mentioned in https://github.com/Mozilla-Ocho/llamafile/issues/446

I tried adding the $HIP_PATH in $PATH to force it into finding clang++.exe, which it indeed finds, but the "missing graphics card" seems to stop the compilation from happening

jeromew commented 5 months ago

I checked the get_amd_offload_arch_flag in https://github.com/Mozilla-Ocho/llamafile/blob/397175e673c4334962f446d9470e3bceefc88fb0/llamafile/cuda.c#L286

the parsing algorithm seems correct and correclty finds gfx1031 in a small c copy I did of just the algo on the saved hipinfo.exe result.

could the problem come from the execution stream pipefds[0] not getting the correct output of the executed hipinfo.exe ?

for now I cannot test that in my compilation setup.

jeromew commented 5 months ago

There is was hipInfo.exe issue in 0.8.6 that stopped the ggml-rocm.dll from beeing compiled (the hipInfo.exe output was not captured). This was fixed in https://github.com/Mozilla-Ocho/llamafile/commit/7d8dd1b33fd54e9e54d4ad8074f8df64e547b75d but no new release have been published since then.

@jart, I compiled a version of llamafile with Cosmopolitan v3.3.10 to try and see if the version could now build its own ggml-rocm.dll via its embedded compilation mechanism.

For this I installed

AMD HIP SDK 5.7.1 via https://www.amd.com/en/developer/resources/rocm-hub/hip-sdk.html
windows build tools via https://aka.ms/vs/17/release/vs_buildtools.exe

Removed

the .llamafile directory

and launched llamafile.exe -m ./llava-v1.5-7b-Q4_K.gguf -ngl 999

after compilation the powershell terminal started shaking and repeating

/D/jerome/dev/llamafile.exe -m ./llava-v1.5-7b-Q4_K.gguf -ngl 999

error: Uncaught SIGSEGV (SEGV_ACCERR) at 0x7ffba90a4f60 on ordib pid 19964 tid 22268
  /D/jerome/dev/llamafile.exe
  No such file or directory
  Windows Cosmopolitan 3.3.10 MODE=x86_64 ordib 0.0-0

RAX ebebfca9cd370000 RBX 00007000007d85c8 RDI 0000000000000000
RCX 00007000007d8588 RDX 00007000007d85d0 RSI 00007000007d8588
RBP 00007000007d8780 RSP 00007000007d8538 RIP 00007ffba90a4f60
 R8 00007ffbc7bf471a  R9 0000000000210150 R10 0000000000210150
R11 000000000295fbaa R12 0000000000000046 R13 000000000000004f
R14 00007000007d85d0 R15 00007ffaf6266cd0
TLS 00000000008efe40

XMM0  00000000000000000000000000000000 XMM8  00000000000000000000000000000000
XMM1  000000000000000d0000000000000000 XMM9  00000000000000000000000000000000
XMM2  000000000000000000000000000000f8 XMM10 00000000000000000000000000000000
XMM3  000000000295fba000007000007d8160 XMM11 00000000000000000000000000000000
XMM4  00007ffbc7c41d910000000243000142 XMM12 00000000000000000000000000000000
XMM5  000004000000040000000000000000d7 XMM13 00000000000000000000000000000000
XMM6  000000000000004d0000000000000060 XMM14 00000000000000000000000000000000
XMM7  00730072006500730055005c003a0043 XMM15 00000000000000000000000000000000

cosmoaddr2line /D/jerome/dev/llamafile.exe 7ffba90a4f60 7000007d8968

000000b38ef0 7ffba90a4f60 NULL+0
7000007d8780 7000007d8968 std::__1::basic_string<wchar_t, std::__1::char_traits<wchar_t>, std::__1::allocator<wchar_t>>::push_back(wchar_t)+72
<dangerous frame>

Now if I remove the ggml-rocm.dll and ggml-rocm.dll.lib and download https://github.com/Mozilla-Ocho/llamafile/blob/main/llamafile/rocm.bat and then execute rocm.bat it compiles ggml-rocm.dll and ggml-rocm.lib and then llamafile.exe -m ./llava-v1.5-7b-Q4_K.gguf -ngl 999 works correclty using ggml-rocm.dll !!

Note that this works both with the original rocm.bat and when I edit it to replace --offload-arch=gfx1010,gfx1012,gfx906,gfx1030,gfx1031,gfx1032,gfx1100,gfx1101,gfx1102,gfx1103 with --offload-arch=gfx1031 which is my arch.

In the rocm.bat version, the command is

"C:\Program Files\AMD\ROCm\5.7\\bin\clang++.exe"   -fuse-ld=lld   -shared   -nostartfiles   -nostdlib   -DGGML_BUILD=1   -DGGML_SHARED=1   -Wno-ignored-attributes   -DGGML_CUDA_DMMV_X=32   -DGGML_CUDA_MMV_Y=1   -DGGML_USE_HIPBLAS   -DGGML_USE_TINYBLAS   -DGGML_MINIMIZE_CODE_SIZE   -DK_QUANTS_PER_ITERATION=2   -D_CRT_SECURE_NO_WARNINGS   -D_XOPEN_SOURCE=600   -D__HIP_PLATFORM_AMD__=1   -D__HIP_PLATFORM_HCC__=1   -isystem "C:\Program Files\AMD\ROCm\5.7\\include"   -O3   -DNDEBUG   -D_DLL   -D_MT   -Xclang --dependent-lib=msvcrt   -std=gnu++14   -mllvm -amdgpu-early-inline-all=true   -mllvm -amdgpu-function-calls=false   -x hip   --hip-link   --offload-arch=gfx1031   -o ggml-rocm.dll   ggml-cuda.cu   "-lC:\Program Files\AMD\ROCm\5.7\\lib\amdhip64.lib"   -lkernel32

while in the auto-compilation procedure the log shows

"/C/Program Files/AMD/ROCm/5.7//bin/clang++.exe" -O3 -shared -x hip --hip-link -std=gnu++14 -fuse-ld=lld -DGGML_USE_HIPBLAS -Wno-return-type -Wno-unused-result -Wno-unused-function -Wno-expansion-to-defined --offload-arch=gfx1031 -Wno-ignored-attributes -D_CRT_SECURE_NO_WARNINGS -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_MULTIPLATFORM -DGGML_CUDA_DMMV_X=32 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -DGGML_CUDA_MMV_Y=1 -DGGML_USE_CUBLAS -o /C/Users/ordib/.llamafile/v/0.8.6/ggml-rocm.dll.r2q27d /C/Users/ordib/.llamafile/v/0.8.6/ggml-cuda.cu -Xclang --dependent-lib=msvcrt -mllvm -amdgpu-function-calls=false -mllvm -amdgpu-early-inline-all=true -isystem "/C/Program Files/AMD/ROCm/5.7//include" -l "/C/Program Files/AMD/ROCm/5.7//lib/hipblas.lib" -l "/C/Program Files/AMD/ROCm/5.7//lib/rocblas.lib" -l "/C/Program Files/AMD/ROCm/5.7//lib/amdhip64.lib" -lkernel32

the main difference seems to be that the auto-compilation procedure seems to compile -DGGML_USE_CUBLAS while the rocm.bat procedure seems to compile with -DGGML_USE_TINYBLAS

and indeed if I .\llamafile.exe --cli -m .\llava-v1.5-7b-Q4_K.gguf -ngl 9999 --tinyblas it auto-compiles tinyblas and it works.

note that the --tinyblas option does not work in the server case (it needs the --cli option as it does not seem to be applied in the server_cli code path)

now I don't know why the GGML_USE_CUBLAS causes an issue while GGML_USE_TINYBLAS works.

could it be because of my graphics card ? does the SIGSEGV (SEGV_ACCERR) message give you a clue on what is the root cause of this issue ?

what is the difference between tinyblas and cublas support and do you think it can be solved or is it a problem inside the proprietary AMD SDK ?

Djip007 commented 5 months ago

If I am correct in your case:

GGML_USE_CUBLAS => use hip/rocmblas
GGML_USE_TINYBLAS => use https://github.com/Mozilla-Ocho/llamafile/blob/main/llamafile/tinyblas.cu

so it may have some "bug" with rocmblas ... can you try with rocm v6.1 (if existe on Windows, only use Linux...)

jeromew commented 5 months ago

Indeed there was a log message in the console that I initially did not see

rocBLAS error: Cannot read C:\Program Files\AMD\ROCm\5.7\bin\/rocblas/library/TensileLibrary.dat: No such file or directory for GPU arch : gfx1031

so it is indeed a problem with the rocblas support for gfx1031 on windows.

it does not seem to be officially supported on Linux either according to https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html

but I saw that there may be a way to make it work by recompiling rocBlas from source or simply adding Tensile files and Kernels pre-compiled for gfx1031 inside ../rocblas/library/ - cf mentions on https://github.com/LostRuins/koboldcpp/issues/441

I tried that. it makes the CUBLAS support work on my gfx1031, but there does not seem to be performance gains compared to tinyblas (~50 tokens/sec in both cases). I was on the expectation that CUBLAS would bring a significative performance boost but that does not seem to be the case in my setup. I will have to dig further to understand if this is to be expected or not.

jeromew commented 4 months ago

Considered fixed after the release of 0.8.7