GPU offloading doesn't seem to be working

v4u6h4n commented 6 months ago

Hey everyone, awesome project :-) am having fun playing around with it, but I think my GPU isn't being utilised. I can see my CPU maxing out, and not seeing much of a change in my GPU usage, just wondering what the issue is. Here's the output in terminal:

/media/storage/Software/AI/Meta-Llama-3-70B-Instruct.Q4_0.llamafile -ngl 9999
import_cuda_impl: initializing gpu module...
get_rocm_bin_path: note: amdclang++ not found on $PATH
get_rocm_bin_path: note: $HIP_PATH/bin/amdclang++ does not exist
get_rocm_bin_path: note: /opt/rocm/bin/amdclang++ does not exist
get_rocm_bin_path: note: hipInfo not found on $PATH
get_rocm_bin_path: note: $HIP_PATH/bin/hipInfo does not exist
get_rocm_bin_path: note: /opt/rocm/bin/hipInfo does not exist
get_rocm_bin_path: note: rocminfo not found on $PATH
get_rocm_bin_path: note: $HIP_PATH/bin/rocminfo does not exist
get_rocm_bin_path: note: /opt/rocm/bin/rocminfo does not exist
get_amd_offload_arch_flag: warning: can't find hipInfo/rocminfo commands for AMD GPU detection
llamafile_log_command: hipcc -O3 -fPIC -shared -DNDEBUG --offload-arch=native -march=native -mtune=native -DGGML_BUILD=1 -DGGML_SHARED=1 -Wno-return-type -Wno-unused-result -DGGML_USE_HIPBLAS -DGGML_CUDA_MMV_Y=1 -DGGML_MULTIPLATFORM -DGGML_CUDA_DMMV_X=32 -DIGNORE4 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -DIGNORE -o /home/v4u6h4n/.llamafile/ggml-rocm.so.dhsn3g /home/v4u6h4n/.llamafile/ggml-cuda.cu -lhipblas -lrocblas
hipcc: Permission denied
extract_cuda_dso: note: prebuilt binary /zip/ggml-rocm.so not found
get_nvcc_path: note: nvcc not found on $PATH
get_nvcc_path: note: $CUDA_PATH/bin/nvcc does not exist
get_nvcc_path: note: /opt/cuda/bin/nvcc does not exist
get_nvcc_path: note: /usr/local/cuda/bin/nvcc does not exist
extract_cuda_dso: note: prebuilt binary /zip/ggml-cuda.so not found
{"function":"server_params_parse","level":"WARN","line":2384,"msg":"Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support","n_gpu_layers":-1,"tid":"8545344","timestamp":1714335027}
note: if you have an AMD or NVIDIA GPU then you need to pass -ngl 9999 to enable GPU offloading
{"build":1500,"commit":"a30b324","function":"server_cli","level":"INFO","line":2839,"msg":"build info","tid":"8545344","timestamp":1714335027}
{"function":"server_cli","level":"INFO","line":2842,"msg":"system info","n_threads":16,"n_threads_batch":-1,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | ","tid":"8545344","timestamp":1714335027,"total_threads":32}
llama_model_loader: loaded meta data with 22 key-value pairs and 723 tensors from Meta-Llama-3-70B-Instruct.Q4_0.gguf (version GGUF V3 (latest))

...and my system specs:

OS: Arch Linux x86_64
Kernel: 6.8.7-arch1-2
CPU: AMD Ryzen 9 7950X3D (32) @ 5.759GHz
GPU: AMD ATI Radeon RX 7900 XT/7900 XTX/7900M
GPU: AMD ATI 13:00.0 Raphael
Memory: 14430MiB / 63427MiB

ahonnecke commented 6 months ago

Same here, Radeon Pro W5700

llava-v1.5-7b-q4.llamafile --version
llamafile v0.8.0

ahonnecke commented 6 months ago

relevant perhaps: https://rocm.docs.amd.com/projects/install-on-linux/en/latest/tutorial/quick-start.html

v4u6h4n commented 6 months ago

relevant perhaps: https://rocm.docs.amd.com/projects/install-on-linux/en/latest/tutorial/quick-start.html

Hey :-)

Did it fix anything for you?

ahonnecke commented 6 months ago

Doesn't seem to have, but I'm not sure that it install properly.

fcrisciani commented 6 months ago

I was able to make it work by changing the base image of my container to FROM nvcr.io/nvidia/pytorch:24.03-py3

That base image is gigantic (~14.6 GB), so probably the best option would be to use docker multi stage build to extract nvcc and its dependencies.

v4u6h4n commented 6 months ago

@fcrisciani Unfortunately I am enough of an amateur linux user that I don't know what that means lol but happy you got it working ;-)

fcrisciani commented 6 months ago

I was referring to creating a docker image (https://docs.docker.com/engine/install/)

My Dockerfile looks like:

FROM nvcr.io/nvidia/pytorch:24.03-py3

RUN apt update && apt install -y wget

COPY start.sh /
RUN chmod +x /start.sh

CMD /start.sh

the start file looks like:

#!/bin/bash

echo "Download llamafile..."
wget https://huggingface.co/jartine/llava-v1.5-7B-GGUF/resolve/main/llava-v1.5-7b-q4.llamafile?download=true -O /tmp/llava-v1.5-7b-q4.llamafile

echo "Start serving the llamafile"
chmod +x /tmp/llava-v1.5-7b-q4.llamafile
/tmp/llava-v1.5-7b-q4.llamafile -ngl 999 --gpu nvidia --nobrowser --host 0.0.0.0

you can: 1) install docker 2) create a folder with the 2 files above: Dockerfile and start.sh 3) build the container image: docker build -t my_gpu_test . 4) run it: docker run --rm -it --gpus=all my_gpu_test

s38b35M5 commented 5 months ago

@fcrisciani it looks like you may be suggesting a fix that works in your case with an nvidia gpu, but the OP issue relates to an amd gpu problem. Considering the use-case of llamafile being a single file LLM that utilizes you gpu, wouldn't a docker install be a big overkill for this problem, and would your fix even address the amd side of things?

fcrisciani commented 5 months ago

conceptually the solution is the same, my understanding is that for nvidia GPU nvcc is the dependency, for AMD instead is hipcc. If you properly install on your machine all the dependencies it should work without using docker. I used docker just to create an image with all the dependencies backed in so that I can move it on different machines without manually installing all the dependencies but it's a user preference

nPHYN1T3 commented 1 month ago

I'm also seeing no GPU offload. When launching I saw it mention you need to pass -ngl 9999 then in the docs/git page it says -ngl 999 (Which is it? I've tried both, with no difference.)

I went digging and tried --gpu nvidia -ngl 999 which now gives me

import_cuda_impl: initializing gpu module...
get_nvcc_path: note: nvcc.exe not found on $PATH
get_nvcc_path: note: /opt/cuda/bin/nvcc.exe does not exist
link_cuda_dso: note: dynamically linking /C/users/user/.llamafile/v/0.8.5/ggml-cuda.dll
link_cuda_dso: warning: library not found: failed to load library
fatal error: support for --gpu nvidia was explicitly requested, but it wasn't available

WAY up in the output

{"function":"server_params_parse","level":"WARN","line":2424,"msg":"Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support","n_gpu_layers":-1,"tid":"11820704","timestamp":1727726156}
note: if you have an AMD or NVIDIA GPU then you need to pass -ngl 9999 to enable GPU offloading

v4u6h4n commented 1 month ago

@nPHYN1T3

I had to focus attention on another project, so I didn't get to follow this up myself, but if you do, you could look into dependencies, which may not be readily documented by this project, that might fix the issue. If I get time this month and give it a try before you do I'll post here again.

nPHYN1T3 commented 1 month ago

I didn't see anything about dependencies but cuda and anything else it might need should be installed as I run ollama on bare metal. The docs are rather disjointed. I saw a few times talking about consult the README.md in a context that didn't make sense since there isn't one when you grab a single "containerized" file.

Mozilla-Ocho / llamafile

GPU offloading doesn't seem to be working #384