AMD GPU support via llama.cpp HIPBLAS

jeromew commented 9 months ago

Hello,

First of all thank you for your work on llamafile it seems like a great idea to simplify model usage.

It seems from the readme that at this stage llamafile does not support AMD GPUs. The cuda.c in llamafile backend seems dedicated to cuda while ggml-cuda.h in llama.cpp has a GGML_USE_HIPBLAS option for ROCm support. ROCm support is now officially supported by llama.cpp according to their README about hipBLAS

I understand that ROCm support was maybe not priority#1 for llamafile but I was wondering if you had already tried to use the HIPBLAS llama.cpp option and have some insights on the work that would need to be done in llamafile in order to add this GPU family target.

from what I understand, the llama.cpp would take care of the GPU side of things, and llamafile would need to be modified to JIT-compile llama.cpp with the correct flags and maybe need a specific toolchain for the compilation (At least ROCm SDK).

Thanks for sharing your experience on this

jart commented 9 months ago

I've never used an AMD GPU before, so I can't answer questions about them. However we're happy to consider your request that llamafile support them. I'd encourage anyone else who wants this to leave a comment saying so. That'll help us gauge the interest level and determine what to focus on.

franalbani commented 9 months ago

I would also like this!

Thanks! Your work is mind-blowing.

mildwood commented 9 months ago

I'm also interested in using ROCm as I don't want to pay double for Nvidia!

jesserizzo commented 9 months ago

Agreed, I'd also love AMD support. In my opinion it also fits with the general mission of this project. Don't let big tech companies get a monopoly on LLMs, and also don't let Nvidia get a monopoly on AI computing. I dunno. Thanks for all the hard work, this is great.

stlhood commented 9 months ago

Thanks for the suggestions, folks. We're going to look into this.

Any recommendations on a specific AMD card we should use for dev and testing? We could pick up an RX 7900 XTX (i.e. basically a RTX 4090 equivalent) but I want to make sure the model we pick is broadly representative in terms of hardware support.

lovenemesis commented 9 months ago

AMD officially only support ROCm on one or two consumer hardware level GPU, RX7900XTX being one of them, with limited Linux distribution.

However, by following the guide here on Fedora, I managed to get both RX 7800XT and the integrated GPU inside Ryzen 7840U running ROCm perfectly fine. Those are the mid and lower models of their RDNA3 lineup. So, I think it's fair to say all RDNA3 ones would work.

Judging from other ROCm related topics on PyTorch, it seems that people with RDNA2 (RX6XXX) series cards are the majority folks. Probably due to the competitive pricing after crypto mining boom?

jammm commented 9 months ago

Thanks for the suggestions, folks. We're going to look into this.

Any recommendations on a specific AMD card we should use for dev and testing? We could pick up an RX 7900 XTX (i.e. basically a RTX 4090 equivalent) but I want to make sure the model we pick is broadly representative in terms of hardware support.

Any RDNA3 card should work fine. It's just the "official" support that's been navi31 based GPUs. You can also compile and run llama.cpp just fine on Windows using the HIP SDK. My primary OS is Windows and I could get this port running myself but I'm having difficulty setting up cosmos bash (getting cosmo++ permission denied errors). If someone can help me setup cosmos bash properly, I should be able to get llamafile up and running on RDNA3 GPUs within a couple hours or so.

Ideally a cmake based pipeline would be best for Windows support, but I'd understand if the makefile paradigm is what cosmopolitan's built on.

jammm commented 9 months ago

So I managed to get llamafile compiling ggml-cuda.so using HIP but it fails at runtime:

building ggml-cuda with nvcc -arch=native...
/usr/bin/hipcc -march=native --shared -use_fast_math -fPIC -O3 -march=native -mtune=native -DNDEBUG -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -DGGML_USE_HIPBLAS -o /home/rpr/.llamafile/ggml-cuda.so.zmifde /home/rpr/.llamafile/ggml-cuda.cu -lhipblas -lrocblas
/home/rpr/.llamafile/ggml-cuda.so
libamdhip64.so.5: cannot enable executable stack as shared object requires: Invalid argument: failed to load library
warning: GPU offload not supported on this platform; GPU related options will be ignored
warning: you might need to install xcode (macos) or cuda (windows, linux, etc.) check the output above to see why support wasn't linked

normal dlopen() works fine with ggml-cuda.so. I'm not sure why cosmo_dlopen fails here. CC @jart any ideas?

Attaching strace log: strace_llamafile_hip.txt

Using Ubuntu 22.04.2 LTS.

EDIT: Looks like this error goes away by setting the execstack bit on amdhip64 runtime library sudo execstack -c /opt/rocm/lib/libamdhip64.so.5

EDIT EDIT: PR made at https://github.com/Mozilla-Ocho/llamafile/pull/122. Works fine on my navi31 machine running Ubuntu 22.04. I expect it to work fine on Windows as well, though it's not tested yet.

franalbani commented 9 months ago

Thanks for the suggestions, folks. We're going to look into this.

Any recommendations on a specific AMD card we should use for dev and testing? We could pick up an RX 7900 XTX (i.e. basically a RTX 4090 equivalent) but I want to make sure the model we pick is broadly representative in terms of hardware support.

I don't have enough experience to give advise, but I can contribute testing on a Thinkpad T495 with AMD Radeon Vega 10 Graphics.

batfasturd commented 9 months ago

I would also like this feature added since it is technically possible. I've used my 6750XT successfully with Llama.cpp and linux.

github12101 commented 9 months ago

User with Radeon 6800XT here, will be more than happy to test things out, in order to have Radeon GPU support. I am using Debian GNU/Linux.

jammm commented 9 months ago

@github12101 @batfasturd try this PR for linux? https://github.com/Mozilla-Ocho/llamafile/pull/122

I've compiled the binary. Let me attach it llamafile_hip_linux.zip

github12101 commented 9 months ago

@github12101 @batfasturd try this PR for linux? #122

I've compiled the binary. Let me attach it llamafile_hip_linux.zip

My apologies but I don't know how to install and launch this. I tried to run it, but it throws an error. Any guideline/tutorial would be great to have.

jammm commented 9 months ago

@github12101 @batfasturd try this PR for linux? #122 I've compiled the binary. Let me attach it llamafile_hip_linux.zip

My apologies but I don't know how to install and launch this. I tried to run it, but it throws an error. Any guideline/tutorial would be great to have.

Can you share the error here? In order for this to work, you need to have ROCm installed on your linux system. Is that installed?

jesserizzo commented 9 months ago

@github12101 @batfasturd try this PR for linux? #122

I've compiled the binary. Let me attach it llamafile_hip_linux.zip

Should this work on AMD 6000 gpus? Edit: I tried it on my 6600 and there's no errors but it doesn't seem to be doing anything. I thought I read somewhere that it only works on 7000 gpus, but one of the project maintainers commented on this PR #122 that they were going to test on a 6800, so now I'm confused.

stlhood commented 9 months ago

Just a quick update that we now have an RDNA2 card, and an RDNA3 card is on the way. @jart is actively working on adding this support!

jart commented 9 months ago

I've just cut a 0.5 release which features outstanding AMD support for Windows users. Let me tell you about how cool it is. If I download and run the LLaVA llamafile, being certain to pass the -ngl 35 flag to turn GPU support on:

curl -L -o llava-v1.5-7b-q4.llamafile https://huggingface.co/jartine/llava-v1.5-7B-GGUF/resolve/main/llava-v1.5-7b-q4.llamafile
.\llava-v1.5-7b-q4.llamafile -ngl 35 --nocompile

Then it takes ~2 seconds to start up (on a cold start!) before a tab with the web ui pops up in my browser. I can then upload a picture and ask LLaVA a question, and I get a response back in a few seconds. That's a good deal, since I'm doing it with a $300 AMD Radeon RX 6800 graphics card.

Here's the best part. Support only depends on the graphics card driver. The HIP SDK is for developers so I think it's nice that we won't need to ask users to install it in order to use llamafile. You can if you want to. In which case llamafile will compile a better GPU module that links hipBLAS (instead of tinyBLAS) the first time you run your llamafile. That will cause inference to go faster. Although it takes about 30 seconds to run the clang++ command that comes with the ROCm SDK.

I'm going to have a Linux computer with the RDNA4 card @stlhood hood mentioned soon, probably by mid-month. We should be able to have excellent AMD support there too, although installing the AMD HIP tools will need to be a requirement, since the Linux and BSDs platforms don't have the same kind of binary friendliness.

bennmann commented 9 months ago

I've just cut a 0.5 release which features outstanding AMD support for Windows users. Let me tell you about how cool it is. If I download and run the LLaVA llamafile, being certain to pass the -ngl 35 flag to turn GPU support on:
curl -L -o llava-v1.5-7b-q4.llamafile https://huggingface.co/jartine/llava-v1.5-7B-GGUF/resolve/main/llava-v1.5-7b-q4.llamafile
.\llava-v1.5-7b-q4.llamafile -ngl 35 --nocompile

My 6900 XT on windows "just worked", pulling 143 watts and producing a nice llava 7b output. Thank you!!


{"timestamp":1704831422,"level":"INFO","function":"log_server_request","line":2741,"message":"request","remote_addr":"","remote_port":-1,"status":200,"method":"POST","path":"/completion","params":{}}
slot 0 released (153 tokens in cache)
slot 0 is processing [task id: 2]
slot 0 : in cache: 47 tokens | to process: 22 tokens
slot 0 : kv cache rm - [47, end)

print_timings: prompt eval time =      88.55 ms /    22 tokens (    4.02 ms per token,   248.46 tokens per second)
print_timings:        eval time =    7817.18 ms /   400 runs   (   19.54 ms per token,    51.17 tokens per second)
print_timings:       total time =    7905.73 ms
slot 0 released (470 tokens in cache)```

jart commented 9 months ago

Happy to hear it @bennmann!

Also there's more good news. I've just shipped a llamafile v0.6 that adds support for AMD GPUs on top of Linux too. Unlike Windows, Linux users need to install the AMD ROCm SDK. There's no prebuilt binary. llamafile will build your GPU support the first time you run your llamafile, using the hipcc compiler. I've tested it with a Radeon RX 7900 XTX. The v0.6 release also adds support for multiple GPUs. I know for certain it works with NVIDIA. I have a second Radeon coming in the mail so I'll be able to test that works with AMD too.

With that said, I think this issue is satisfactorily solved. Please report any issues or suboptimal experiences you have. The only real area of improvement I know we need at the moment, is that our tinyBLAS kernels don't go as fast on AMD as they do on NVIDIA where we developed them. We'll be looking into that soon. Note that this only impacts Windows users who haven't installed the HIP ROCm SDK on their computers. That's what you want, if your goal is to get maximum performance, since the hipBLAS library doesn't come with the video drivers that Windows installs.

Enjoy!

lovenemesis commented 9 months ago

I'm having issue getting GPU support on Fedora 39 with 0.6 release.

HSA_OVERRIDE_GFX_VERSION=11.0.0 ./llamafile-0.6 -ngl 35 -m mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf 
initializing gpu module...
note: won't compile AMD GPU support because $HIP_PATH/bin/clang++ is missing
prebuilt binary /zip/ggml-rocm.so not found
prebuilt binary /zip/ggml-cuda.so not found
fatal error: --n-gpu-layers 35 was passed but no gpus were found

Meanwhile, I do have clang++ and hipcc available in $PATH.

sudo rpm -ql hipcc clang
/usr/bin/hipcc
/usr/bin/hipcc.pl
/usr/bin/hipconfig
/usr/bin/hipconfig.pl
/usr/share/perl5/vendor_perl/hipvars.pm
/usr/bin/clang
/usr/bin/clang++
/usr/bin/clang++-17
/usr/bin/clang-17
/usr/bin/clang-cl
/usr/bin/clang-cpp
/usr/lib/.build-id
/usr/lib/.build-id/32
/usr/lib/.build-id/32/e94d93e9ba24c19eb5ffdd7288d637d7cda793
/usr/lib/.build-id/32/e94d93e9ba24c19eb5ffdd7288d637d7cda793.1
/usr/lib/.build-id/32/e94d93e9ba24c19eb5ffdd7288d637d7cda793.2
/usr/lib/.build-id/32/e94d93e9ba24c19eb5ffdd7288d637d7cda793.3
/usr/share/licenses/clang
/usr/share/licenses/clang/LICENSE.TXT
/usr/share/man/man1/clang++-17.1.gz
/usr/share/man/man1/clang++.1.gz
/usr/share/man/man1/clang-17.1.gz
/usr/share/man/man1/clang.1.gz

Anything else do I need?

AwesomeApple12 commented 9 months ago

Screenshot 2024-01-10 021429

I've just cut a 0.5 release which features outstanding AMD support for Windows users. Let me tell you about how cool it is. If I download and run the LLaVA llamafile, being certain to pass the -ngl 35 flag to turn GPU support on:
curl -L -o llava-v1.5-7b-q4.llamafile https://huggingface.co/jartine/llava-v1.5-7B-GGUF/resolve/main/llava-v1.5-7b-q4.llamafile
.\llava-v1.5-7b-q4.llamafile -ngl 35 --nocompile

Can confirm working really well with a 5700xt on windows 11.

jeromew commented 9 months ago

Can confirm it works really well with AMD Radeon RX 6700 XT on windows 11 (~36 tokens/sec versus ~5.6 tokens/sec on CPU only) ! Thank you for landing the AMD support !

Note that windows complained about it containing a trojan

Detected: Trojan:Win32/Sabsik.FL.A!ml File: D:\user\dev\llava-v1.5-7b-q4.llamafile

I checked that the sha256 was equal to the one declared on hugginface 9c37a9a8e3f067dea8c028db9525b399fc53b267667ed9c2a60155b1aa75 and went through with it but that was a bit surprising. Am I the only one getting this warning ?

The parameters are "-ngl 35 --nocompile" for the tinyBLAS solution but what are the parameters if I install ROCm ?

Amine-Smahi commented 8 months ago

Is there a documentation somewhere to guide us to run llamafile on ubuntu with AMD gpu ?

Dark-Thoughts commented 7 months ago

Tried to get my 6650 XT to work under Nobara (Fedora based) by installing rocm-hip-sdk and got this error after I think it failed to properly build on first launch:

./mistral-7b-instruct-v0.2.Q5_K_M.llamafile -ngl 999
import_cuda_impl: initializing gpu module...
get_rocm_bin_path: note: amdclang++ not found on $PATH
get_rocm_bin_path: note: $HIP_PATH/bin/amdclang++ does not exist
get_rocm_bin_path: note: hipInfo not found on $PATH
get_rocm_bin_path: note: $HIP_PATH/bin/hipInfo does not exist
get_rocm_bin_path: note: /opt/rocm/bin/hipInfo does not exist
llamafile_log_command: /usr/bin/rocminfo
llamafile_log_command: hipcc -O3 -fPIC -shared -DNDEBUG --offload-arch=gfx1032 -march=native -mtune=native -use_fast_math -DGGML_BUILD=1 -DGGML_SHARED=1 -Wno-return-type -Wno-unused-result -DGGML_USE_HIPBLAS -DGGML_CUDA_MMV_Y=1 -DGGML_MULTIPLATFORM -DGGML_CUDA_DMMV_X=32 -DIGNORE4 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -DIGNORE -o /home/*****/.llamafile/ggml-rocm.so.ikigfn /home/*****/.llamafile/ggml-cuda.cu -lhipblas -lrocblas
/home/*****/.llamafile/ggml-cuda.cu:408:1: warning: function declared 'noreturn' should not return [-Winvalid-noreturn]
}
^
/home/*****/.llamafile/ggml-cuda.cu:777:1: warning: function declared 'noreturn' should not return [-Winvalid-noreturn]
}
^
/home/*****/.llamafile/ggml-cuda.cu:5132:5: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
    mul_mat_q4_K(
    ^
/home/*****/.llamafile/ggml-cuda.cu:5132:5: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
/home/*****/.llamafile/ggml-cuda.cu:5199:1: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
mul_mat_q5_K(
^
/home/*****/.llamafile/ggml-cuda.cu:5199:1: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
/home/*****/.llamafile/ggml-cuda.cu:5268:5: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
    mul_mat_q6_K(
    ^
/home/*****/.llamafile/ggml-cuda.cu:5268:5: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
/home/*****/.llamafile/ggml-cuda.cu:6034:24: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
static __global__ void soft_max_f32(const float * x, const float * y, float * dst, const int ncols_par, const int nrows_y, const float scale) {
                       ^
/home/*****/.llamafile/ggml-cuda.cu:6034:24: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
/home/*****/.llamafile/ggml-cuda.cu:6034:24: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
/home/*****/.llamafile/ggml-cuda.cu:6034:24: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
/home/*****/.llamafile/ggml-cuda.cu:6034:24: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
/home/*****/.llamafile/ggml-cuda.cu:6034:24: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
14 warnings generated when compiling for gfx1032.
/home/*****/.llamafile/ggml-cuda.cu:408:1: warning: function declared 'noreturn' should not return [-Winvalid-noreturn]
}
^
/home/*****/.llamafile/ggml-cuda.cu:777:1: warning: function declared 'noreturn' should not return [-Winvalid-noreturn]
}
^
2 warnings generated when compiling for host.
link_cuda_dso: note: dynamically linking /home/*****/.llamafile/ggml-rocm.so
ggml_cuda_link: welcome to ROCm SDK with hipBLAS
link_cuda_dso: GPU support linked

rocBLAS error: Cannot read /opt/rocm-5.6.1/lib/rocblas/library/TensileLibrary.dat: No such file or directory
Aborted (core dumped)

Launching through the gpu again just gives me the last error now. No idea what else I'm missing or what I did wrong here but it's certainly not an easy experience under Linux with AMD gpus (previous it would just default to CPU mode, which is way too slow to be usable).

nameiwillforget commented 7 months ago

I'm getting a very similar bug:

[alex@Arch wizard]$ sh wizardcoder-python-34b-v1.0.Q5_K_M.llamafile -ngl 9999 
import_cuda_impl: initializing gpu module...
get_rocm_bin_path: note: amdclang++ not found on $PATH
get_rocm_bin_path: note: $HIP_PATH/bin/amdclang++ does not exist
get_rocm_bin_path: note: /opt/rocm/bin/amdclang++ does not exist
get_rocm_bin_path: note: hipInfo not found on $PATH
get_rocm_bin_path: note: $HIP_PATH/bin/hipInfo does not exist
get_rocm_bin_path: note: /opt/rocm/bin/hipInfo does not exist
llamafile_log_command: /opt/rocm/bin/rocminfo
llamafile_log_command: hipcc -O3 -fPIC -shared -DNDEBUG --offload-arch=gfx1032 -march=native -mtune=native -use_fast_math -DGGML_BUILD=1 -DGGML_SHARED=1 -Wno-return-type -Wno-unused-result -DGGML_USE_HIPBLAS -DGGML_CUDA_MMV_Y=1 -DGGML_MULTIPLATFORM -DGGML_CUDA_DMMV_X=32 -DIGNORE4 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -DIGNORE -o /home/alex/.llamafile/ggml-rocm.so.r2whcc /home/alex/.llamafile/ggml-cuda.cu -lhipblas -lrocblas
/home/alex/.llamafile/ggml-cuda.cu:408:1: warning: function declared 'noreturn' should not return [-Winvalid-noreturn]
}
^
/home/alex/.llamafile/ggml-cuda.cu:777:1: warning: function declared 'noreturn' should not return [-Winvalid-noreturn]
}
^
/home/alex/.llamafile/ggml-cuda.cu:5132:5: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
    mul_mat_q4_K(
    ^
/home/alex/.llamafile/ggml-cuda.cu:5132:5: warning: loop not unrolled: the optimizer was unabl
e to perform the requested transformation; the transformation might be disabled or specified a
s part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
/home/alex/.llamafile/ggml-cuda.cu:5199:1: warning: loop not unrolled: the optimizer was unabl
e to perform the requested transformation; the transformation might be disabled or specified a
s part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
mul_mat_q5_K(
^
/home/alex/.llamafile/ggml-cuda.cu:5199:1: warning: loop not unrolled: the optimizer was unabl
e to perform the requested transformation; the transformation might be disabled or specified a
s part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
/home/alex/.llamafile/ggml-cuda.cu:5268:5: warning: loop not unrolled: the optimizer was unabl
e to perform the requested transformation; the transformation might be disabled or specified a
s part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
    mul_mat_q6_K(
    ^
/home/alex/.llamafile/ggml-cuda.cu:5268:5: warning: loop not unrolled: the optimizer was unabl
e to perform the requested transformation; the transformation might be disabled or specified a
s part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
/home/alex/.llamafile/ggml-cuda.cu:6034:24: warning: loop not unrolled: the optimizer was unab
le to perform the requested transformation; the transformation might be disabled or specified 
as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
static __global__ void soft_max_f32(const float * x, const float * y, float * dst, const int n
cols_par, const int nrows_y, const float scale) {
                       ^
/home/alex/.llamafile/ggml-cuda.cu:6034:24: warning: loop not unrolled: the optimizer was unab
le to perform the requested transformation; the transformation might be disabled or specified 
as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
/home/alex/.llamafile/ggml-cuda.cu:6034:24: warning: loop not unrolled: the optimizer was unab
le to perform the requested transformation; the transformation might be disabled or specified 
as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
/home/alex/.llamafile/ggml-cuda.cu:6034:24: warning: loop not unrolled: the optimizer was unab
le to perform the requested transformation; the transformation might be disabled or specified 
as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
/home/alex/.llamafile/ggml-cuda.cu:6034:24: warning: loop not unrolled: the optimizer was unab
le to perform the requested transformation; the transformation might be disabled or specified 
as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
/home/alex/.llamafile/ggml-cuda.cu:6034:24: warning: loop not unrolled: the optimizer was unab
le to perform the requested transformation; the transformation might be disabled or specified 
as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
14 warnings generated when compiling for gfx1032.
/home/alex/.llamafile/ggml-cuda.cu:408:1: warning: function declared 'noreturn' should not return [-Winvalid-noreturn]
}
^
/home/alex/.llamafile/ggml-cuda.cu:777:1: warning: function declared 'noreturn' should not return [-Winvalid-noreturn]
}
^
2 warnings generated when compiling for host.
link_cuda_dso: note: dynamically linking /home/alex/.llamafile/ggml-rocm.so
wizardcoder-python-34b-v1.0.Q5_K_M.llamafile: /usr/src/debug/hip-runtime-amd/clr-rocm-6.0.0/rocclr/os/os_posix.cpp:321: static void amd::Os::currentStackInfo(unsigned char**, size_t*): Assertion `Os::currentStackPtr() >= *base - *size && Os::currentStackPtr() < *base && "just checking"' failed.
cosmoaddr2line /home/alex/.local/bin/wizardcoder-python-34b-v1.0.Q5_K_M.llamafile 7fe973ea932c 7fe973e1a0e0 7fe973e1a0e0 7fe973e1a0e0 7fe973e1a0e0 7fe973e1a0e0 7fe973e1a0e0 7fe973e1a0e0 7fe973e1a0e0 7fe973e1a0e0 7fe973e1a0e0 7fe973e1a0e0 7fe973e1a0e0

0x00007fe973ea932c: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0

10008004-10008006 rw-pa-       3x automap 192kB w/ 64kB hole
10008008-10008011 rw-pa-      10x automap 640kB w/ 14gB hole
10040060-10098eec r--s-- 364'173x automap 22gB w/ 96tB hole
6fd00004-6fd0000c rw-paF       9x zipos 576kB w/ 64gB hole
6fe00004-6fe00004 rw-paF       1x g_fds 64kB
# 22gB total mapped memory
/home/alex/.local/bin/wizardcoder-python-34b-v1.0.Q5_K_M.llamafile -m wizardcoder-python-34b-v1.0.Q5_K_M.gguf -c 0 -ngl 9999 
Aborted (core dumped)

Mozilla-Ocho / llamafile

AMD GPU support via llama.cpp HIPBLAS #92