Closed jeromew closed 4 months ago
Was llamafile v0.8.4 prebuilt amd gpu support working for you on Windows?
I wasn't able to include prebuild AMD GPU support for Windows users in the recent release for a couple reasons, one of which being https://github.com/ggerganov/llama.cpp/issues/7156.
There's a workaround you should be able to use. You need to install the AMD ROCm "HIP SDK" on your computer. Once that's installed, llamafile will compile just for your machine automatically a highly-optimized GPU module that'll give you a better experience.
yes it was the prebuilt amd gpu support that was working with 0.8.4
I understand that this is all moving very fast ; thank you for your help.
according to https://rocm.docs.amd.com/projects/install-on-windows/en/latest/reference/system-requirements.html
so it seems I can only have the HIP runtime and not the SDK
does that mean I am out of luck with my GPU ?
I will try installing the HIP SDK anyway and report here what happens
You're not out of luck. Me and the llama.cpp developers are working on finding a way to reduce the code size, so we can include the prebuilt ggml-rocm.dll for you in a future release real soon. I recommend just using 0.8.4 for a few weeks until that happens. Sound good?
yes I will continue using 0.8.4 for now.
so you think it won't work on my setup with "RX 6700 XT" even if I install the HIP SDK ? it is true that for this card the amd spec page only talks about "runtime" compatibility so I guess that excludes the just-in-time compilation that you described.
libraries are described on https://rocm.docs.amd.com/en/latest/reference/api-libraries.html
I am not sure I understand what is the needed ROCm component as a dependency for the just-in-time GPU support compilation. Is it the C++ libraries that are mentioned as
?
As another solution that would not involve a GGML_MINIMIZE_CODE_SIZE
flag, could it be possible maybe to have the prebuilt amd gpu support compiled into the standalone llamafile (I mean the llamafile.exe that need the -m option and a gguf file) ? ;
I have started using this as some models I tested don't fit under the 4Gb limit) so it would be ok for me if the tinyBLAS AMD windows prebuilt support was only available in the standalone llamafile runtime.
Here's the Windows AMD GPU DSO I built for the last release that wasn't included. You can use zipalign to include it yourself if you can make it fit. ggml-rocm.dll.zip
I don't know what specific component is needed from ROCm. If you're proposing we bundle AMD's DSOs in our llamafile releases, I'd be reluctant to do that. I'm already unhappy about how the address space has to be tainted in order to talk to GPUs. I don't know how we'd call this project open source if our release artifacts were tainted too.
so you think it won't work on my setup with "RX 6700 XT" even if I install the HIP SDK ?
I don't have Windows, but on linux to rebuild you need :
1- if use --recompile --tinyblas
=> only need HIP SDK no lib at all
2- if use --recompile
=> need HIP SDK + hipBlas + RocBlas (SDK...)
(more element here: https://github.com/Mozilla-Ocho/llamafile/issues/188)
Note: I need to find time to test with last llamafile it may change.
quick test (on linux / AMD Ryzen 9 5950X + AMD Radeon RX 6900 XT)
> ./llamafile-0.8.6 -m Mistral-7b-instruct-v0.2.F16.llamafile --temp 0.7 -p '[INST]Write a story about llamas. Please write it in french for a young girle that nead to go to bed.[/INST]'
llama_print_timings: prompt eval time = 556.46 ms / 33 tokens ( 16.86 ms per token, 59.30 tokens per second)
llama_print_timings: eval time = 37776.07 ms / 133 runs ( 284.03 ms per token, 3.52 tokens per second)
> ./llamafile-0.8.6 -m Mistral-7b-instruct-v0.2.F16.llamafile -ngl 9999 --nocompile --temp 0.7 -p '[INST]Write a story about llamas. Please write it in french for a young girle that nead to go to bed.[/INST]'
llama_print_timings: prompt eval time = 229.94 ms / 33 tokens ( 6.97 ms per token, 143.52 tokens per second)
llama_print_timings: eval time = 73144.58 ms / 1411 runs ( 51.84 ms per token, 19.29 tokens per second)
> ./llamafile-0.8.6 -m Mistral-7b-instruct-v0.2.F16.llamafile -ngl 9999 --recompile --tinyblas --temp 0.7 -p '[INST]Write a story about llamas. Please write it in french for a young girle that nead to go to bed.[/INST]'
llama_print_timings: prompt eval time = 233.25 ms / 33 tokens ( 7.07 ms per token, 141.48 tokens per second)
llama_print_timings: eval time = 38342.75 ms / 811 runs ( 47.28 ms per token, 21.15 tokens per second)
> ./llamafile-0.8.6 -m Mistral-7b-instruct-v0.2.F16.llamafile -ngl 9999 --recompile --temp 0.7 -p '[INST]Write a story about llamas. Please write it in french for a young girle that nead to go to bed.[/INST]'
llama_print_timings: prompt eval time = 119.48 ms / 33 tokens ( 3.62 ms per token, 276.20 tokens per second)
llama_print_timings: eval time = 26408.86 ms / 583 runs ( 45.30 ms per token, 22.08 tokens per second)
for quick test use new release with old "weight". remove ./llamafile-0.8.6 -m
if you have new complet file
with longer prompt:
> ./llamafile-0.8.6 -m Mistral-7b-instruct-v0.2.F16.llamafile -ngl 9999 --recompile -p "..."
llama_print_timings: prompt eval time = 1029.84 ms / 1466 tokens ( 0.70 ms per token, 1423.53 tokens per second)
llama_print_timings: eval time = 21118.46 ms / 432 runs ( 48.89 ms per token, 20.46 tokens per second)
> ./llamafile-0.8.6 -m Mistral-7b-instruct-v0.2.F16.llamafile -ngl 9999 --recompile --tinyblas -p "..."
llama_print_timings: prompt eval time = 1852.72 ms / 1466 tokens ( 1.26 ms per token, 791.27 tokens per second)
llama_print_timings: eval time = 28902.66 ms / 518 runs ( 55.80 ms per token, 17.92 tokens per second)
Thanks for posting your numbers!
V0.8.6 is really impressive for BF16 and Q6_K...
llamafile-bench-0.8.6 -p "256,512,1024" -m "mistral-7b-instruct-v0.2.BF16.gguf,mistral-7b-instruct-v0.2.F16.gguf,mistral-7b-instruct-v0.2.Q4_K_M.gguf,mistral-7b-instruct-v0.2.Q5_K_S.gguf,mistral-7b-instruct-v0.2.Q6_K.gguf,mistral-7b-instruct-v0.2.Q8_0.gguf"
cpu_info | model_filename | size | test | t/s |
---|---|---|---|---|
AMD Ryzen 9 7940HS (znver4) | mistral-7b-instruct-v0.2.BF16 | 13.49 GiB | pp256 | 102.51 |
AMD Ryzen 9 7940HS (znver4) | mistral-7b-instruct-v0.2.BF16 | 13.49 GiB | pp512 | 95.03 |
AMD Ryzen 9 7940HS (znver4) | mistral-7b-instruct-v0.2.BF16 | 13.49 GiB | pp1024 | 94.20 |
AMD Ryzen 9 7940HS (znver4) | mistral-7b-instruct-v0.2.BF16 | 13.49 GiB | tg16 | 4.00 |
AMD Ryzen 9 7940HS (znver4) | mistral-7b-instruct-v0.2.F16 | 13.49 GiB | pp256 | 63.04 |
AMD Ryzen 9 7940HS (znver4) | mistral-7b-instruct-v0.2.F16 | 13.49 GiB | pp512 | 61.92 |
AMD Ryzen 9 7940HS (znver4) | mistral-7b-instruct-v0.2.F16 | 13.49 GiB | pp1024 | 61.84 |
AMD Ryzen 9 7940HS (znver4) | mistral-7b-instruct-v0.2.F16 | 13.49 GiB | tg16 | 4.03 |
AMD Ryzen 9 7940HS (znver4) | mistral-7b-instruct-v0.2.Q8_0 | 7.17 GiB | pp256 | 53.83 |
AMD Ryzen 9 7940HS (znver4) | mistral-7b-instruct-v0.2.Q8_0 | 7.17 GiB | pp512 | 53.19 |
AMD Ryzen 9 7940HS (znver4) | mistral-7b-instruct-v0.2.Q8_0 | 7.17 GiB | pp1024 | 52.33 |
AMD Ryzen 9 7940HS (znver4) | mistral-7b-instruct-v0.2.Q8_0 | 7.17 GiB | tg16 | 7.26 |
AMD Ryzen 9 7940HS (znver4) | mistral-7b-instruct-v0.2.Q6_K | 5.53 GiB | pp256 | 88.61 |
AMD Ryzen 9 7940HS (znver4) | mistral-7b-instruct-v0.2.Q6_K | 5.53 GiB | pp512 | 85.63 |
AMD Ryzen 9 7940HS (znver4) | mistral-7b-instruct-v0.2.Q6_K | 5.53 GiB | pp1024 | 82.87 |
AMD Ryzen 9 7940HS (znver4) | mistral-7b-instruct-v0.2.Q6_K | 5.53 GiB | tg16 | 9.11 |
AMD Ryzen 9 7940HS (znver4) | mistral-7b-instruct-v0.2.Q5_K_S | 4.65 GiB | pp256 | 64.35 |
AMD Ryzen 9 7940HS (znver4) | mistral-7b-instruct-v0.2.Q5_K_S | 4.65 GiB | pp512 | 82.24 |
AMD Ryzen 9 7940HS (znver4) | mistral-7b-instruct-v0.2.Q5_K_S | 4.65 GiB | pp1024 | 80.29 |
AMD Ryzen 9 7940HS (znver4) | mistral-7b-instruct-v0.2.Q5_K_S | 4.65 GiB | tg16 | 11.33 |
AMD Ryzen 9 7940HS (znver4) | mistral-7b-instruct-v0.2.Q4_K_M | 4.07 GiB | pp256 | 89.18 |
AMD Ryzen 9 7940HS (znver4) | mistral-7b-instruct-v0.2.Q4_K_M | 4.07 GiB | pp512 | 82.77 |
AMD Ryzen 9 7940HS (znver4) | mistral-7b-instruct-v0.2.Q4_K_M | 4.07 GiB | pp1024 | 83.03 |
AMD Ryzen 9 7940HS (znver4) | mistral-7b-instruct-v0.2.Q4_K_M | 4.07 GiB | tg16 | 11.19 |
(you can compare with https://github.com/Mozilla-Ocho/llamafile/issues/439#issuecomment-2130079810 for actual llama.cpp )
is it possible to use llamafile-bench with GPU?
Try passing the -fa
for flash attention which makes it go even faster. I don't like the GPU implementation but the CPU impl is great. I'm able to get 961 tok/sec at prompt processing with Mistral on Threadripper Pro. That's a 20% speed boost for me. It's one of the most excellent performance optimizations I've seen from @ggerganov recently. Why not enable it by default?
llamafile-bench will support GPU soon. It's a bit trickier because llama-bench was designed in a way that assumes GPU support was figured out at compile-time. So it'll likely take some overhauling.
in my case it is slower... dit I made mistake?
#> ryzen 7940HS:
> ../../llamafile-0.8.6 -m mistral-7b-instruct-v0.2.BF16.gguf -fa -ngl 0 --temp 0 -c 2048
llama_print_timings: prompt eval time = 18860.70 ms / 1466 tokens ( 12.87 ms per token, 77.73 tokens per second)
llama_print_timings: eval time = 120744.94 ms / 437 runs ( 276.30 ms per token, 3.62 tokens per second)
> ../../llamafile-0.8.6 -m mistral-7b-instruct-v0.2.BF16.gguf -ngl 0 --temp 0 -c 2048
llama_print_timings: prompt eval time = 17340.75 ms / 1466 tokens ( 11.83 ms per token, 84.54 tokens per second)
llama_print_timings: eval time = 103088.90 ms / 384 runs ( 268.46 ms per token, 3.72 tokens per second)
(for GPU I have to rebuild with LLAMA_HIP_UMA=1 after make some modification on llamafile)
Interesting, so in some environments it can make things slower. I wonder why that is. Maybe that's why it isn't enabled by default. Thanks for sharing this. As for LLAMA_HIP_UMA=1
do you know what, if anything, it'll do to environments that don't have this? If you know how to detect it at runtime, I could change ggml-cuda to runtime dispatch to the right implementation.
for GPU
#> AMD Radeon RX 6900 XT:
> ./llamafile-0.8.6 -m Mistral-7b-instruct-v0.2.F16.llamafile -ngl 9999 --recompile -p "..."
llama_print_timings: prompt eval time = 1029.84 ms / 1466 tokens ( 0.70 ms per token, 1423.53 tokens per second)
llama_print_timings: eval time = 21118.46 ms / 432 runs ( 48.89 ms per token, 20.46 tokens per second)
> ./llamafile-0.8.6 -m Mistral-7b-instruct-v0.2.F16.llamafile -fa -ngl 9999 --recompile -p "..."
llama_print_timings: prompt eval time = 1298.41 ms / 1466 tokens ( 0.89 ms per token, 1129.07 tokens per second)
llama_print_timings: eval time = 25759.27 ms / 494 runs ( 52.14 ms per token, 19.18 tokens per second)
> ./llamafile-0.8.6 -m Mistral-7b-instruct-v0.2.F16.llamafile -ngl 9999 --recompile --tinyblas -p "..."
llama_print_timings: prompt eval time = 1855.15 ms / 1466 tokens ( 1.27 ms per token, 790.23 tokens per second)
llama_print_timings: eval time = 21282.14 ms / 384 runs ( 55.42 ms per token, 18.04 tokens per second)
> ./llamafile-0.8.6 -m Mistral-7b-instruct-v0.2.F16.llamafile -fa -ngl 9999 --recompile --tinyblas -p "..."
llama_print_timings: prompt eval time = 1923.64 ms / 1466 tokens ( 1.31 ms per token, 762.10 tokens per second)
llama_print_timings: eval time = 19991.50 ms / 384 runs ( 52.06 ms per token, 19.21 tokens per second)
It looks like flash attention is still a work in progress for AMD GPUs. It's probably due to it being 6mb of code. AMD GPUs usually have smaller instruction caches and are more sensitive than NVIDIA to code size issues.
It looks like flash attention is still a work in progress for AMD GPUs. It's probably due to it being 6mb of code. AMD GPUs usually have smaller instruction caches and are more sensitive than NVIDIA to code size issues.
I need to go to bed... bud will add HIP_UMA (and the Optimisation) and test with that on ryzen 7940HS tomorrow.
OK made some patch (https://github.com/Djip007/llamafile/tree/feature/hip_uma)
--use-hip-uma
=> when rebuild use HIP_UMA for gpu memory alloc.hipMemAdviseSetCoarseGrain
for best perfo.Some result: (BF16 on CPU FP16 on GPU)
#> ryzen 7940HS:
> ../../llamafile-0.8.6 -m mistral-7b-instruct-v0.2.BF16.gguf -fa -ngl 0 --temp 0 -c 2048
llama_print_timings: prompt eval time = 18860.70 ms / 1466 tokens ( 12.87 ms per token, 77.73 tokens per second)
llama_print_timings: eval time = 120744.94 ms / 437 runs ( 276.30 ms per token, 3.62 tokens per second)
> ../../llamafile-0.8.6 -m mistral-7b-instruct-v0.2.BF16.gguf -ngl 0 --temp 0 -c 2048
llama_print_timings: prompt eval time = 17340.75 ms / 1466 tokens ( 11.83 ms per token, 84.54 tokens per second)
llama_print_timings: eval time = 103088.90 ms / 384 runs ( 268.46 ms per token, 3.72 tokens per second)
>>- with HIP_UMA
> ../../usr/bin/llamafile -m mistral-7b-instruct-v0.2.F16.gguf -ngl 9999 --recompile --tinyblas --use-hip-uma --temp 0 -c 2048 -p "..."
llama_print_timings: prompt eval time = 31051.46 ms / 1466 tokens ( 21.18 ms per token, 47.21 tokens per second)
llama_print_timings: eval time = 138180.55 ms / 384 runs ( 359.85 ms per token, 2.78 tokens per second)
> HSA_OVERRIDE_GFX_VERSION=11.0.1 ../../usr/bin/llamafile -m mistral-7b-instruct-v0.2.F16.gguf -ngl 9999 --recompile --use-hip-uma --temp 0 -c 2048 -p "..."
llama_print_timings: prompt eval time = 14817.91 ms / 1466 tokens ( 10.11 ms per token, 98.93 tokens per second)
llama_print_timings: eval time = 157568.49 ms / 635 runs ( 248.14 ms per token, 4.03 tokens per second)
>>- with HIP_UMA+"CoarseGrain patch"
> ../../usr/bin/llamafile -m mistral-7b-instruct-v0.2.F16.gguf -fa -ngl 9999 --recompile --tinyblas --use-hip-uma --temp 0 -c 2048 -p "..."
llama_print_timings: prompt eval time = 12391.85 ms / 1466 tokens ( 8.45 ms per token, 118.30 tokens per second)
llama_print_timings: eval time = 102629.02 ms / 384 runs ( 267.26 ms per token, 3.74 tokens per second)
> ../../usr/bin/llamafile -m mistral-7b-instruct-v0.2.F16.gguf -ngl 9999 --recompile --tinyblas --use-hip-uma --temp 0 -c 2048 -p "..."
llama_print_timings: prompt eval time = 11119.20 ms / 1466 tokens ( 7.58 ms per token, 131.84 tokens per second)
llama_print_timings: eval time = 83272.67 ms / 384 runs ( 216.86 ms per token, 4.61 tokens per second)
> HSA_OVERRIDE_GFX_VERSION=11.0.1 ../../usr/bin/llamafile -m mistral-7b-instruct-v0.2.F16.gguf -fa -ngl 9999 --recompile --use-hip-uma --temp 0 -c 2048 -p "..."
llama_print_timings: prompt eval time = 9719.47 ms / 1466 tokens ( 6.63 ms per token, 150.83 tokens per second)
llama_print_timings: eval time = 114512.12 ms / 437 runs ( 262.04 ms per token, 3.82 tokens per second)
> HSA_OVERRIDE_GFX_VERSION=11.0.1 ../../usr/bin/llamafile -m mistral-7b-instruct-v0.2.F16.gguf -ngl 9999 --recompile --use-hip-uma --temp 0 -c 2048 -p "..."
llama_print_timings: prompt eval time = 7208.44 ms / 1466 tokens ( 4.92 ms per token, 203.37 tokens per second)
llama_print_timings: eval time = 101313.12 ms / 507 runs ( 199.83 ms per token, 5.00 tokens per second)
as you see:
-fa
is bad on AMD GPU (RDNA2 and RDNA3-APU)Here's the Windows AMD GPU DSO I built for the last release that wasn't included. You can use zipalign to include it yourself if you can make it fit. ggml-rocm.dll.zip
I don't know what specific component is needed from ROCm. If you're proposing we bundle AMD's DSOs in our llamafile releases, I'd be reluctant to do that. I'm already unhappy about how the address space has to be tainted in order to talk to GPUs. I don't know how we'd call this project open source if our release artifacts were tainted too.
I tested the ggml-rocm.dll you provided by simply putting it in the .llamafile/v/0.8.5/
directory and it worked.
I am not totally familiar yet with the way the releases are built. I tought that 0.8.4 came bundled with the ggml-rocm.dll so my idea was that :
also I am not sure if this could help here (because of the needed alignments), but long ago on high compression requirements on windows I used https://upx.github.io/ with good success
I tried to install HIP SDK (version 5.7)
it added a HIP_PATH
environment variable with C:\Program Files\AMD\ROCm\5.7\bin
the available .exe are
so when llamafile looks for amdclang++.exe
get_rocm_bin_path: note: amdclang++.exe not found on $PATH get_rocm_bin_path: note: /C/Program Files/AMD/ROCm/5.7//bin/amdclang++.exe does not exist
then it looks for clang++.exe
but cannot find it.
get_rocm_bin_path: note: clang++.exe not found on $PATH
should it look for the one in HIP_PATH ?
then it looks for hipInfo.exe
in $PATH but cannot find it
get_rocm_bin_path: note: hipInfo.exe not found on $PATH
then it seems to find it in $HIP_PATH
llamafile_log_command: "/C/Program Files/AMD/ROCm/5.7//bin/hipInfo.exe"
but it does'nt seem to find a graphics card even though there is no doubt I have a AMD Radeon RX 6700 XT
(arch gfx1031)
--------------------------------------------------------------------------------
device# 0
Name: AMD Radeon RX 6700 XT
pciBusID: 45
pciDeviceID: 0
pciDomainID: 0
multiProcessorCount: 20
maxThreadsPerMultiProcessor: 2048
isMultiGpuBoard: 0
clockRate: 2424 Mhz
memoryClockRate: 1000 Mhz
memoryBusWidth: 0
totalGlobalMem: 11.98 GB
totalConstMem: 2147483647
sharedMemPerBlock: 64.00 KB
canMapHostMemory: 1
regsPerBlock: 0
warpSize: 32
l2CacheSize: 4194304
computeMode: 0
maxThreadsPerBlock: 1024
maxThreadsDim.x: 1024
maxThreadsDim.y: 1024
maxThreadsDim.z: 1024
maxGridSize.x: 2147483647
maxGridSize.y: 65536
maxGridSize.z: 65536
major: 10
minor: 3
concurrentKernels: 1
cooperativeLaunch: 0
cooperativeMultiDeviceLaunch: 0
isIntegrated: 0
maxTexture1D: 16384
maxTexture2D.width: 16384
maxTexture2D.height: 16384
maxTexture3D.width: 2048
maxTexture3D.height: 2048
maxTexture3D.depth: 2048
isLargeBar: 0
asicRevision: 0
maxSharedMemoryPerMultiProcessor: 64.00 KB
clockInstructionRate: 1000.00 Mhz
arch.hasGlobalInt32Atomics: 1
arch.hasGlobalFloatAtomicExch: 1
arch.hasSharedInt32Atomics: 1
arch.hasSharedFloatAtomicExch: 1
arch.hasFloatAtomicAdd: 1
arch.hasGlobalInt64Atomics: 1
arch.hasSharedInt64Atomics: 1
arch.hasDoubles: 1
arch.hasWarpVote: 1
arch.hasWarpBallot: 1
arch.hasWarpShuffle: 1
arch.hasFunnelShift: 0
arch.hasThreadFenceSystem: 1
arch.hasSyncThreadsExt: 0
arch.hasSurfaceFuncs: 0
arch.has3dGrid: 1
arch.hasDynamicParallelism: 0
gcnArchName: gfx1031
peers:
non-peers: device#0
memInfo.total: 11.98 GB
memInfo.free: 11.86 GB (99%)
get_amd_offload_arch_flag: warning: hipInfo output didn't list any graphics cards
and then it tries to fallback on the prebuilt AMD GPU support on Windows but does not find it, which is normal for 0.8.5 and 0.8.6
Note that the "missing graphics card" problem is also mentioned in https://github.com/Mozilla-Ocho/llamafile/issues/446
I tried adding the $HIP_PATH in $PATH to force it into finding clang++.exe
, which it indeed finds, but the "missing graphics card" seems to stop the compilation from happening
I checked the get_amd_offload_arch_flag
in https://github.com/Mozilla-Ocho/llamafile/blob/397175e673c4334962f446d9470e3bceefc88fb0/llamafile/cuda.c#L286
the parsing algorithm seems correct and correclty finds gfx1031
in a small c copy I did of just the algo on the saved hipinfo.exe result.
could the problem come from the execution stream pipefds[0]
not getting the correct output of the executed hipinfo.exe
?
for now I cannot test that in my compilation setup.
There is was hipInfo.exe issue in 0.8.6 that stopped the ggml-rocm.dll
from beeing compiled (the hipInfo.exe output was not captured). This was fixed in https://github.com/Mozilla-Ocho/llamafile/commit/7d8dd1b33fd54e9e54d4ad8074f8df64e547b75d but no new release have been published since then.
@jart, I compiled a version of llamafile with Cosmopolitan v3.3.10 to try and see if the version could now build its own ggml-rocm.dll
via its embedded compilation mechanism.
For this I installed
Removed
and launched llamafile.exe -m ./llava-v1.5-7b-Q4_K.gguf -ngl 999
after compilation the powershell terminal started shaking and repeating
/D/jerome/dev/llamafile.exe -m ./llava-v1.5-7b-Q4_K.gguf -ngl 999
error: Uncaught SIGSEGV (SEGV_ACCERR) at 0x7ffba90a4f60 on ordib pid 19964 tid 22268
/D/jerome/dev/llamafile.exe
No such file or directory
Windows Cosmopolitan 3.3.10 MODE=x86_64 ordib 0.0-0
RAX ebebfca9cd370000 RBX 00007000007d85c8 RDI 0000000000000000
RCX 00007000007d8588 RDX 00007000007d85d0 RSI 00007000007d8588
RBP 00007000007d8780 RSP 00007000007d8538 RIP 00007ffba90a4f60
R8 00007ffbc7bf471a R9 0000000000210150 R10 0000000000210150
R11 000000000295fbaa R12 0000000000000046 R13 000000000000004f
R14 00007000007d85d0 R15 00007ffaf6266cd0
TLS 00000000008efe40
XMM0 00000000000000000000000000000000 XMM8 00000000000000000000000000000000
XMM1 000000000000000d0000000000000000 XMM9 00000000000000000000000000000000
XMM2 000000000000000000000000000000f8 XMM10 00000000000000000000000000000000
XMM3 000000000295fba000007000007d8160 XMM11 00000000000000000000000000000000
XMM4 00007ffbc7c41d910000000243000142 XMM12 00000000000000000000000000000000
XMM5 000004000000040000000000000000d7 XMM13 00000000000000000000000000000000
XMM6 000000000000004d0000000000000060 XMM14 00000000000000000000000000000000
XMM7 00730072006500730055005c003a0043 XMM15 00000000000000000000000000000000
cosmoaddr2line /D/jerome/dev/llamafile.exe 7ffba90a4f60 7000007d8968
000000b38ef0 7ffba90a4f60 NULL+0
7000007d8780 7000007d8968 std::__1::basic_string<wchar_t, std::__1::char_traits<wchar_t>, std::__1::allocator<wchar_t>>::push_back(wchar_t)+72
<dangerous frame>
Now if I remove the ggml-rocm.dll
and ggml-rocm.dll.lib
and download https://github.com/Mozilla-Ocho/llamafile/blob/main/llamafile/rocm.bat and then execute rocm.bat
it compiles ggml-rocm.dll
and ggml-rocm.lib
and then llamafile.exe -m ./llava-v1.5-7b-Q4_K.gguf -ngl 999
works correclty using ggml-rocm.dll
!!
Note that this works both with the original rocm.bat
and when I edit it to replace
--offload-arch=gfx1010,gfx1012,gfx906,gfx1030,gfx1031,gfx1032,gfx1100,gfx1101,gfx1102,gfx1103
with --offload-arch=gfx1031
which is my arch.
In the rocm.bat version, the command is
"C:\Program Files\AMD\ROCm\5.7\\bin\clang++.exe" -fuse-ld=lld -shared -nostartfiles -nostdlib -DGGML_BUILD=1 -DGGML_SHARED=1 -Wno-ignored-attributes -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_USE_HIPBLAS -DGGML_USE_TINYBLAS -DGGML_MINIMIZE_CODE_SIZE -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_XOPEN_SOURCE=600 -D__HIP_PLATFORM_AMD__=1 -D__HIP_PLATFORM_HCC__=1 -isystem "C:\Program Files\AMD\ROCm\5.7\\include" -O3 -DNDEBUG -D_DLL -D_MT -Xclang --dependent-lib=msvcrt -std=gnu++14 -mllvm -amdgpu-early-inline-all=true -mllvm -amdgpu-function-calls=false -x hip --hip-link --offload-arch=gfx1031 -o ggml-rocm.dll ggml-cuda.cu "-lC:\Program Files\AMD\ROCm\5.7\\lib\amdhip64.lib" -lkernel32
while in the auto-compilation procedure the log shows
"/C/Program Files/AMD/ROCm/5.7//bin/clang++.exe" -O3 -shared -x hip --hip-link -std=gnu++14 -fuse-ld=lld -DGGML_USE_HIPBLAS -Wno-return-type -Wno-unused-result -Wno-unused-function -Wno-expansion-to-defined --offload-arch=gfx1031 -Wno-ignored-attributes -D_CRT_SECURE_NO_WARNINGS -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_MULTIPLATFORM -DGGML_CUDA_DMMV_X=32 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -DGGML_CUDA_MMV_Y=1 -DGGML_USE_CUBLAS -o /C/Users/ordib/.llamafile/v/0.8.6/ggml-rocm.dll.r2q27d /C/Users/ordib/.llamafile/v/0.8.6/ggml-cuda.cu -Xclang --dependent-lib=msvcrt -mllvm -amdgpu-function-calls=false -mllvm -amdgpu-early-inline-all=true -isystem "/C/Program Files/AMD/ROCm/5.7//include" -l "/C/Program Files/AMD/ROCm/5.7//lib/hipblas.lib" -l "/C/Program Files/AMD/ROCm/5.7//lib/rocblas.lib" -l "/C/Program Files/AMD/ROCm/5.7//lib/amdhip64.lib" -lkernel32
the main difference seems to be that the auto-compilation procedure seems to compile -DGGML_USE_CUBLAS
while the rocm.bat
procedure seems to compile with -DGGML_USE_TINYBLAS
and indeed if I .\llamafile.exe --cli -m .\llava-v1.5-7b-Q4_K.gguf -ngl 9999 --tinyblas
it auto-compiles tinyblas and it works.
note that the --tinyblas
option does not work in the server case (it needs the --cli
option as it does not seem to be applied in the server_cli code path)
now I don't know why the GGML_USE_CUBLAS
causes an issue while GGML_USE_TINYBLAS
works.
could it be because of my graphics card ? does the SIGSEGV (SEGV_ACCERR) message give you a clue on what is the root cause of this issue ?
what is the difference between tinyblas and cublas support and do you think it can be solved or is it a problem inside the proprietary AMD SDK ?
If I am correct in your case:
so it may have some "bug" with rocmblas ... can you try with rocm v6.1 (if existe on Windows, only use Linux...)
Indeed there was a log message in the console that I initially did not see
rocBLAS error: Cannot read C:\Program Files\AMD\ROCm\5.7\bin\/rocblas/library/TensileLibrary.dat: No such file or directory for GPU arch : gfx1031
so it is indeed a problem with the rocblas support for gfx1031 on windows.
it does not seem to be officially supported on Linux either according to https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html
but I saw that there may be a way to make it work by recompiling rocBlas from source or simply adding Tensile files and Kernels pre-compiled for gfx1031 inside ../rocblas/library/
- cf mentions on https://github.com/LostRuins/koboldcpp/issues/441
I tried that. it makes the CUBLAS support work on my gfx1031, but there does not seem to be performance gains compared to tinyblas (~50 tokens/sec in both cases). I was on the expectation that CUBLAS would bring a significative performance boost but that does not seem to be the case in my setup. I will have to dig further to understand if this is to be expected or not.
Considered fixed after the release of 0.8.7
Hello,
on my computer, with an "AMD 6700 XT" graphics card, the tinyBLAS is working with 0.8.4.
now with 0.8.5 it says
the file is present in
/C/Users/ordib/.llamafile/v/0.8.5/ggml-cuda.dll
when I look in the directory. in the 0.8.4 version the file is/C/Users/ordib/.llamafile/ggml-cuda.dll
and it loads correctly, logging that tinyBLAS was setup.note: I tried both with
-ngl 35
and-ngl 9999
- not sure what is the correct way now for AMD/tinyBLAS supporttell me if you need more information to understand what is the difference between 0.8.4 and 0.8.5 on this issue