Mozilla-Ocho / llamafile

Distribute and run LLMs with a single file.
https://llamafile.ai
Other
18.93k stars 960 forks source link

AMD gfx1103 laptop GPU returning `HIPBLAS_STATUS_UNKNOWN` #188

Closed lovenemesis closed 4 months ago

lovenemesis commented 8 months ago

First, great work on getting AMD GPU support ready on Windows in such a good shape within such a short period. Really appreciate your work!

However, once I switched to Fedora 39, on the same Ryzen 7840U with Radeon 780M laptop, things become a bit puzzling.

At first, it complains about not finding clang++ and hipcc:

HSA_OVERRIDE_GFX_VERSION=11.0.0 ./llamafile-0.6 -ngl 35 -m mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf 
initializing gpu module...
note: won't compile AMD GPU support because $HIP_PATH/bin/clang++ is missing
prebuilt binary /zip/ggml-rocm.so not found
prebuilt binary /zip/ggml-cuda.so not found
fatal error: --n-gpu-layers 35 was passed but no gpus were found

Although, I do have clang++ and hipcc available in $PATH.

sudo rpm -ql hipcc clang
/usr/bin/hipcc
/usr/bin/hipcc.pl
/usr/bin/hipconfig
/usr/bin/hipconfig.pl
/usr/share/perl5/vendor_perl/hipvars.pm
/usr/bin/clang
/usr/bin/clang++
/usr/bin/clang++-17
/usr/bin/clang-17
/usr/bin/clang-cl
/usr/bin/clang-cpp
/usr/lib/.build-id
/usr/lib/.build-id/32
/usr/lib/.build-id/32/e94d93e9ba24c19eb5ffdd7288d637d7cda793
/usr/lib/.build-id/32/e94d93e9ba24c19eb5ffdd7288d637d7cda793.1
/usr/lib/.build-id/32/e94d93e9ba24c19eb5ffdd7288d637d7cda793.2
/usr/lib/.build-id/32/e94d93e9ba24c19eb5ffdd7288d637d7cda793.3
/usr/share/licenses/clang
/usr/share/licenses/clang/LICENSE.TXT
/usr/share/man/man1/clang++-17.1.gz
/usr/share/man/man1/clang++.1.gz
/usr/share/man/man1/clang-17.1.gz
/usr/share/man/man1/clang.1.gz

Then I figured out it might need a bit of manual help, hence adding an environment variable:

HIP_PATH=/usr ./llamafile-0.6 -ngl 35 -m mistral-7b-instruct-v0.1.Q4_K_M.gguf 
initializing gpu module...
extracting /zip/llama.cpp/ggml.h to /home/tommy/.llamafile/ggml.h
extracting /zip/llamafile/compcap.cu to /home/tommy/.llamafile/compcap.cu
extracting /zip/llamafile/llamafile.h to /home/tommy/.llamafile/llamafile.h
extracting /zip/llamafile/tinyblas.h to /home/tommy/.llamafile/tinyblas.h
extracting /zip/llamafile/tinyblas.cu to /home/tommy/.llamafile/tinyblas.cu
extracting /zip/llama.cpp/ggml-impl.h to /home/tommy/.llamafile/ggml-impl.h
extracting /zip/llama.cpp/ggml-cuda.h to /home/tommy/.llamafile/ggml-cuda.h
extracting /zip/llama.cpp/ggml-alloc.h to /home/tommy/.llamafile/ggml-alloc.h
extracting /zip/llama.cpp/ggml-backend.h to /home/tommy/.llamafile/ggml-backend.h
extracting /zip/llama.cpp/ggml-backend-impl.h to /home/tommy/.llamafile/ggml-backend-impl.h
extracting /zip/llama.cpp/ggml-cuda.cu to /home/tommy/.llamafile/ggml-cuda.cu
/usr/bin/rocminfo
hipcc -O3 -fPIC -shared -DNDEBUG --offload-arch=gfx1103 -march=native -mtune=native -use_fast_math -DGGML_BUILD=1 -DGGML_SHARED=1 -Wno-return-type -Wno-unused-result -DGGML_USE_HIPBLAS -DGGML_CUDA_MMV_Y=1 -DGGML_CUDA_DMMV_X=32 -DIGNORE4 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -DIGNORE -o /home/tommy/.llamafile/ggml-rocm.so.rzs98e /home/tommy/.llamafile/ggml-cuda.cu -lhipblas -lrocblas
/home/tommy/.llamafile/ggml-cuda.cu:4595:5: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
 4595 |     mul_mat_q4_K(
      |     ^
/home/tommy/.llamafile/ggml-cuda.cu:4595:5: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
/home/tommy/.llamafile/ggml-cuda.cu:4662:1: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
 4662 | mul_mat_q5_K(
      | ^
/home/tommy/.llamafile/ggml-cuda.cu:4662:1: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
/home/tommy/.llamafile/ggml-cuda.cu:4731:5: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
 4731 |     mul_mat_q6_K(
      |     ^
/home/tommy/.llamafile/ggml-cuda.cu:4731:5: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
6 warnings generated when compiling for gfx1103.
dynamically linking /home/tommy/.llamafile/ggml-rocm.so
GPU support successfully linked and loaded
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 ROCm/CUDA devices:
  Device 0: AMD Radeon Graphics, compute capability 11.0, VMM: no
{"timestamp":1704891476,"level":"INFO","function":"server_cli","line":2812,"message":"build info","build":1500,"commit":"a30b324"}
{"timestamp":1704891476,"level":"INFO","function":"server_cli","line":2815,"message":"system info","n_threads":8,"n_threads_batch":-1,"total_threads":16,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | "}
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from mistral-7b-instruct-v0.1.Q4_K_M.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.1
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 15
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 4.07 GiB (4.83 BPW) 
llm_load_print_meta: general.name     = mistralai_mistral-7b-instruct-v0.1
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size       =    0.11 MiB
llm_load_tensors: using ROCm/CUDA for GPU acceleration
llm_load_tensors: system memory used  =   70.42 MiB
llm_load_tensors: VRAM used           = 4095.05 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
...............................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: VRAM kv self = 64.00 MB
llama_new_context_with_model: KV self size  =   64.00 MiB, K (f16):   32.00 MiB, V (f16):   32.00 MiB
llama_build_graph: non-view tensors processed: 676/676
llama_new_context_with_model: compute buffer total size = 76.19 MiB
llama_new_context_with_model: VRAM scratch buffer: 73.00 MiB
llama_new_context_with_model: total VRAM used: 4232.06 MiB (model: 4095.05 MiB, context: 137.00 MiB)
CUDA error: unknown error
  current device: 0, in function ggml_cuda_op_mul_mat_cublas at /home/tommy/.llamafile/ggml-cuda.cu:7784
  hipblasGemmEx(g_cublas_handles[id], HIPBLAS_OP_T, HIPBLAS_OP_N, row_diff, src1_ncols, ne10, &alpha_f16, src0_ptr, HIPBLAS_R_16F, ne00, src1_ptr, HIPBLAS_R_16F, ne10, &beta_f16, dst_f16.get(), HIPBLAS_R_16F, ldc, HIPBLAS_R_16F, HIPBLAS_GEMM_DEFAULT)
GGML_ASSERT: /home/tommy/.llamafile/ggml-cuda.cu:386: !"CUDA error"

This time it compiles but eventually failed due to cuda error.

May I know what additional steps I should take it get it working?

Thanks,

Originally posted by @lovenemesis in https://github.com/Mozilla-Ocho/llamafile/issues/92#issuecomment-1884106689

jart commented 8 months ago

Thanks for filing a new issue.

I noticed two things orthogonal to your issue, but thanks to your issue, that could be improved. I've done so in 15e23397ec07daade647125cabb32f40bef02bf9 and 67d97b520278232f88b852e8e6d61847ce0728b8.

Could you build llamafile at HEAD and try again? You should get the same error, but it'll give us something better to go on than "unknown error".

git clone https://github.com/Mozilla-Ocho/llamafile
cd llamafile
make -j16
o//llama.cpp/main/main -ngl 35 --gpu amd -m mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf -p hello

Thanks!

lovenemesis commented 8 months ago

Here's the output. I chose to use mistral model instead, considering only 16G VRAM allocated to GPU on this 7840U laptop:

❯ o//llama.cpp/main/main -ngl 35 --gpu amd -m ../llamafile/mistral-7b-instruct-v0.1.Q4_K_M.gguf -p hello
import_cuda_impl: initializing gpu module...
get_rocm_bin_path: note: amdclang++ not found on $PATH
get_rocm_bin_path: note: $HIP_PATH/bin/amdclang++ does not exist
get_rocm_bin_path: note: /opt/rocm/bin/amdclang++ does not exist
link_cuda_dso: note: dynamically linking /home/tommy/.llamafile/ggml-rocm.so
import_cuda: GPU support successfully linked and loaded
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 ROCm/CUDA devices:
  Device 0: AMD Radeon Graphics, compute capability 11.0, VMM: no
Log start
main: llamafile version 0.6.0
main: seed  = 1704949319
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from ../llamafile/mistral-7b-instruct-v0.1.Q4_K_M.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.1
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 15
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 4.07 GiB (4.83 BPW) 
llm_load_print_meta: general.name     = mistralai_mistral-7b-instruct-v0.1
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size       =    0.11 MiB
llm_load_tensors: using ROCm/CUDA for GPU acceleration
llm_load_tensors: system memory used  =   70.42 MiB
llm_load_tensors: VRAM used           = 4095.05 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
...............................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: VRAM kv self = 64.00 MB
llama_new_context_with_model: KV self size  =   64.00 MiB, K (f16):   32.00 MiB, V (f16):   32.00 MiB
llama_build_graph: non-view tensors processed: 676/676
llama_new_context_with_model: compute buffer total size = 76.19 MiB
llama_new_context_with_model: VRAM scratch buffer: 73.00 MiB
llama_new_context_with_model: total VRAM used: 4232.06 MiB (model: 4095.05 MiB, context: 137.00 MiB)
CUDA error: HIPBLAS_STATUS_UNKNOWN
  current device: 0, in function ggml_cuda_op_mul_mat_cublas at /home/tommy/.llamafile/ggml-cuda.cu:8266
  hipblasGemmEx(g_cublas_handles[id], HIPBLAS_OP_T, HIPBLAS_OP_N, row_diff, src1_ncols, ne10, &alpha_f16, src0_ptr, HIPBLAS_R_16F, ne00, src1_ptr, HIPBLAS_R_16F, ne10, &beta_f16, dst_f16.get(), HIPBLAS_R_16F, ldc, HIPBLAS_R_16F, HIPBLAS_GEMM_DEFAULT)
GGML_ASSERT: /home/tommy/.llamafile/ggml-cuda.cu:390: !"CUDA error"

Since it seems to related with hipblas, below are my installations:

sudo rpm -ql hipblas hipblas-devel
/usr/lib/.build-id
/usr/lib/.build-id/19
/usr/lib/.build-id/19/aead64ad45551a081d5ece41628341839d0f22
/usr/lib/.build-id/19/aead64ad45551a081d5ece41628341839d0f22.1
/usr/lib/.build-id/90
/usr/lib/.build-id/90/5c38f38c531a0a65299c003f29c7706f5da7d9
/usr/lib/.build-id/cc
/usr/lib/.build-id/cc/621395ca064e1036285a414b67b19e908bebde
/usr/lib/.build-id/cc/621395ca064e1036285a414b67b19e908bebde.1
/usr/lib64/cmake/hipblas
/usr/lib64/libhipblas.so.1
/usr/lib64/libhipblas.so.1.1.0
/usr/lib64/rocm/gfx10/lib/libhipblas.so.1
/usr/lib64/rocm/gfx10/lib/libhipblas.so.1.1.0
/usr/lib64/rocm/gfx11/lib/libhipblas.so.1
/usr/lib64/rocm/gfx11/lib/libhipblas.so.1.1.0
/usr/lib64/rocm/gfx8/lib/libhipblas.so.1
/usr/lib64/rocm/gfx8/lib/libhipblas.so.1.1.0
/usr/lib64/rocm/gfx9/lib/libhipblas.so.1
/usr/lib64/rocm/gfx9/lib/libhipblas.so.1.1.0
/usr/share/licenses/hipblas
/usr/share/licenses/hipblas/LICENSE.md
/usr/include/hipblas
/usr/include/hipblas/hipblas-export.h
/usr/include/hipblas/hipblas-version.h
/usr/include/hipblas/hipblas.h
/usr/include/hipblas/hipblas_module.f90
/usr/lib64/cmake/hipblas
/usr/lib64/cmake/hipblas/hipblas-config-version.cmake
/usr/lib64/cmake/hipblas/hipblas-config.cmake
/usr/lib64/cmake/hipblas/hipblas-targets-release.cmake
/usr/lib64/cmake/hipblas/hipblas-targets.cmake
/usr/lib64/libhipblas.so
/usr/lib64/rocm/gfx10/lib/cmake/hipblas
/usr/lib64/rocm/gfx10/lib/cmake/hipblas/hipblas-config-version.cmake
/usr/lib64/rocm/gfx10/lib/cmake/hipblas/hipblas-config.cmake
/usr/lib64/rocm/gfx10/lib/cmake/hipblas/hipblas-targets-release.cmake
/usr/lib64/rocm/gfx10/lib/cmake/hipblas/hipblas-targets.cmake
/usr/lib64/rocm/gfx10/lib/libhipblas.so
/usr/lib64/rocm/gfx11/lib/cmake/hipblas
/usr/lib64/rocm/gfx11/lib/cmake/hipblas/hipblas-config-version.cmake
/usr/lib64/rocm/gfx11/lib/cmake/hipblas/hipblas-config.cmake
/usr/lib64/rocm/gfx11/lib/cmake/hipblas/hipblas-targets-release.cmake
/usr/lib64/rocm/gfx11/lib/cmake/hipblas/hipblas-targets.cmake
/usr/lib64/rocm/gfx11/lib/libhipblas.so
/usr/lib64/rocm/gfx8/lib/cmake/hipblas
/usr/lib64/rocm/gfx8/lib/cmake/hipblas/hipblas-config-version.cmake
/usr/lib64/rocm/gfx8/lib/cmake/hipblas/hipblas-config.cmake
/usr/lib64/rocm/gfx8/lib/cmake/hipblas/hipblas-targets-release.cmake
/usr/lib64/rocm/gfx8/lib/cmake/hipblas/hipblas-targets.cmake
/usr/lib64/rocm/gfx8/lib/libhipblas.so
/usr/lib64/rocm/gfx9/lib/cmake/hipblas
/usr/lib64/rocm/gfx9/lib/cmake/hipblas/hipblas-config-version.cmake
/usr/lib64/rocm/gfx9/lib/cmake/hipblas/hipblas-config.cmake
/usr/lib64/rocm/gfx9/lib/cmake/hipblas/hipblas-targets-release.cmake
/usr/lib64/rocm/gfx9/lib/cmake/hipblas/hipblas-targets.cmake
/usr/lib64/rocm/gfx9/lib/libhipblas.so
/usr/share/doc/hipblas-devel
/usr/share/doc/hipblas-devel/README.md

Let me know anything else I could provide to assist diagnostic.

Thanks,

jart commented 8 months ago

Earlier @cgmb said in https://github.com/Mozilla-Ocho/llamafile/pull/122#issuecomment-1875855445 said your gfx1103 laptop gpu "will only work for programs that limit themselves to a compatible subset of the ISA". In that case I'd ask @cgmb if hipBLAS returning HIPBLAS_STATUS_UNKNOWN is indicative of us not using a compatible subset?

Also, @lovenemesis, I noticed you passed HSA_OVERRIDE_GFX_VERSION=11.0.0. Did you find the earlier comment I referenced to figure that out? Should we be specifying that flag automatically? Or should we simply just ignore gfx1103 GPUs from now on?

lovenemesis commented 8 months ago

Regardless of setting HSA_OVERRIDE_GFX_VERSION=11.0.0 or not, I got the same error:

CUDA error: HIPBLAS_STATUS_UNKNOWN
  current device: 0, in function ggml_cuda_op_mul_mat_cublas at /home/tommy/.llamafile/ggml-cuda.cu:8266
  hipblasGemmEx(g_cublas_handles[id], HIPBLAS_OP_T, HIPBLAS_OP_N, row_diff, src1_ncols, ne10, &alpha_f16, src0_ptr, HIPBLAS_R_16F, ne00, src1_ptr, HIPBLAS_R_16F, ne10, &beta_f16, dst_f16.get(), HIPBLAS_R_16F, ldc, HIPBLAS_R_16F, HIPBLAS_GEMM_DEFAULT)
GGML_ASSERT: /home/tommy/.llamafile/ggml-cuda.cu:390: !"CUDA error"

If I rebooted into Win11 X64 on the same laptop, I could offload to the exact gfx1103 GPU on llamafile 0.6 release without any tweaking. Hence, I assume the GPU itself should retain the capability to handle workload. It works well with PyTorch + ROCm, too.

cgmb commented 8 months ago

Earlier @cgmb said in #122 (comment) said your gfx1103 laptop gpu "will only work for programs that limit themselves to a compatible subset of the ISA". In that case I'd ask @cgmb if hipBLAS returning HIPBLAS_STATUS_UNKNOWN is indicative of us not using a compatible subset?

I don't know. When you run machine code that contains instructions that don't exist (or don't mean the same thing as the compiler expected) on the processor you're running it on... you've entered the land of undefined behavior. It's plausible that could be your problem, but it could be something else entirely. As far as I know, AMD does not test ROCm on gfx1103 hardware.

Also, @lovenemesis, I noticed you passed HSA_OVERRIDE_GFX_VERSION=11.0.0. Did you find the earlier comment I referenced to figure that out? Should we be specifying that flag automatically? Or should we simply just ignore gfx1103 GPUs from now on?

rocBLAS/hipBLAS was not designed to support gfx1103, regardless of whether HSA_OVERRIDE_GFX_VERSION is specified or not. That's true of many architectures and there are often simple ways to make things work regardless, but gfx1103 doesn't seem to be one of them.

In the future, I'm hoping that the 'generic' ISAs proposed for introduction into LLVM will avoid the need for ugly hacks like HSA_OVERRIDE_GFX_VERSION. With that said, there might still be issues besides the ISA compatibility that limit the usability of ROCm on APUs like the Radeon 780M.

lovenemesis commented 8 months ago

If it helps, I could run the program again on a desktop equipped with RX 7800XT(gfx1101) to see it if things are different.

@jart

Correct me if I'm wrong, the reason why llamafile 0.6 could offload to gfx1103 on Win11 X64 is because of the usage of tinyBLAS, not rocBLAS/hipBLAS. Right?

Thanks,

lovenemesis commented 8 months ago

Below is the result of running on RX 7800XT(gfx1101) using Fedora 39. A different module is used, but it shouldn't matter in this case. I had to specify HIP_VISIBLE_DEVICES=0 since this machine has a iGPU(Ryzen 5 5700G) though disabled in BIOS. But the error message remained much the same. Setting HSA_OVERRIDE_GFX_VERSION or not, the error remained unchanged.

HIP_VISIBLE_DEVICES=0 HSA_OVERRIDE_GFX_VERSION=11.0.1 ./main -ngl 35 --gpu amd -m /mnt/LINDATA/LLM/llamafile/chinese-alpaca-2-7b.gguf -p hello
import_cuda_impl: initializing gpu module...
get_rocm_bin_path: note: amdclang++ not found on $PATH
get_rocm_bin_path: note: $HIP_PATH/bin/amdclang++ does not exist
get_rocm_bin_path: note: /opt/rocm/bin/amdclang++ does not exist
link_cuda_dso: note: dynamically linking /home/tommy/.llamafile/ggml-rocm.so
import_cuda: GPU support successfully linked and loaded
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 ROCm/CUDA devices:
  Device 0: AMD Radeon RX 7800 XT, compute capability 11.0, VMM: no
Log start
main: llamafile version 0.6.0
main: seed  = 1705320567
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /mnt/LINDATA/LLM/llamafile/chinese-alpaca-2-7b.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 1
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,55296]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,55296]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,55296]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  17:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  18:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  19:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type  f16:  226 tensors
llm_load_vocab: mismatch in special tokens definition ( 889/55296 vs 259/55296 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 55296
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 6.93 B
llm_load_print_meta: model size       = 12.91 GiB (16.00 BPW) 
llm_load_print_meta: general.name     = LLaMA v2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size       =    0.11 MiB
llm_load_tensors: using ROCm/CUDA for GPU acceleration
llm_load_tensors: system memory used  =  432.11 MiB
llm_load_tensors: VRAM used           = 12785.02 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
...............................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: VRAM kv self = 256.00 MB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_build_graph: non-view tensors processed: 676/676
llama_new_context_with_model: compute buffer total size = 119.19 MiB
llama_new_context_with_model: VRAM scratch buffer: 116.00 MiB
llama_new_context_with_model: total VRAM used: 13157.02 MiB (model: 12785.02 MiB, context: 372.00 MiB)
CUDA error: HIPBLAS_STATUS_UNKNOWN
  current device: 0, in function ggml_cuda_op_mul_mat_cublas at /home/tommy/.llamafile/ggml-cuda.cu:8266
  hipblasGemmEx(g_cublas_handles[id], HIPBLAS_OP_T, HIPBLAS_OP_N, row_diff, src1_ncols, ne10, &alpha_f16, src0_ptr, HIPBLAS_R_16F, ne00, src1_ptr, HIPBLAS_R_16F, ne10, &beta_f16, dst_f16.get(), HIPBLAS_R_16F, ldc, HIPBLAS_R_16F, HIPBLAS_GEMM_DEFAULT)
GGML_ASSERT: /home/tommy/.llamafile/ggml-cuda.cu:390: !"CUDA error"

Though 7800XT isn't officially listed, it should share more similarity with 7900XT/XTX than 780M.

Note I'm using the Fedora packaged ROCm 5.7.1 as described here. Perhaps I may jump too early as it's officially ready in Fedora 40, but if this turned out to be a packaging related issue, I'm happy to bring this to the attention of the package maintainer.

Please let me know if additional information I can provide to help narrow down the potential factor.

jart commented 8 months ago

Correct me if I'm wrong, the reason why llamafile 0.6 could offload to gfx1103 on Win11 X64 is because of the usage of tinyBLAS, not rocBLAS/hipBLAS. Right?

That would be interesting. Could you try passing the --recompile --tinyblas flags on Linux with your gfx1103 and tell me if that works? If so, tell me which model you're using and how many tokens per second you're getting.

lovenemesis commented 8 months ago

That would be interesting. Could you try passing the --recompile --tinyblas flags on Linux with your gfx1103 and tell me if that works? If so, tell me which model you're using and how many tokens per second you're getting.

This one works on Linux!

./o/llama.cpp/main/main -ngl 35 --recompile --tinyblas -m ../llamafile/mistral-7b-instruct-v0.1.Q4_K_M.gguf -p hello
import_cuda_impl: initializing gpu module...
get_rocm_bin_path: note: amdclang++ not found on $PATH
get_rocm_bin_path: note: $HIP_PATH/bin/amdclang++ does not exist
get_rocm_bin_path: note: /opt/rocm/bin/amdclang++ does not exist
get_rocm_bin_path: note: hipInfo not found on $PATH
get_rocm_bin_path: note: $HIP_PATH/bin/hipInfo does not exist
get_rocm_bin_path: note: /opt/rocm/bin/hipInfo does not exist
llamafile_log_command: /usr/bin/rocminfo
llamafile_log_command: hipcc -O3 -fPIC -shared -DNDEBUG --offload-arch=gfx1103 -march=native -mtune=native -use_fast_math -DGGML_BUILD=1 -DGGML_SHARED=1 -Wno-return-type -Wno-unused-result -DGGML_USE_HIPBLAS -DGGML_CUDA_MMV_Y=1 -DGGML_MULTIPLATFORM -DGGML_CUDA_DMMV_X=32 -DIGNORE4 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -DGGML_USE_TINYBLAS -o /home/tommy/.llamafile/ggml-rocm.so.ximg77 /home/tommy/.llamafile/ggml-cuda.cu -DIGNORE5 -DIGNORE6
/home/tommy/.llamafile/ggml-cuda.cu:5712:24: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
 5712 | static __global__ void soft_max_f32(const float * x, const float * y, float * dst, const int ncols_par, const int nrows_y, const float scale) {
      |                        ^
/home/tommy/.llamafile/ggml-cuda.cu:5712:24: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
/home/tommy/.llamafile/ggml-cuda.cu:5712:24: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
/home/tommy/.llamafile/ggml-cuda.cu:5712:24: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
/home/tommy/.llamafile/ggml-cuda.cu:5712:24: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
/home/tommy/.llamafile/ggml-cuda.cu:5712:24: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
6 warnings generated when compiling for gfx1103.
link_cuda_dso: note: dynamically linking /home/tommy/.llamafile/ggml-rocm.so
import_cuda: GPU support successfully linked and loaded
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 ROCm/CUDA devices:
  Device 0: AMD Radeon Graphics, compute capability 11.0, VMM: no
Log start
main: llamafile version 0.6.0
main: seed  = 1705365565
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from ../llamafile/mistral-7b-instruct-v0.1.Q4_K_M.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.1
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 15
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 4.07 GiB (4.83 BPW) 
llm_load_print_meta: general.name     = mistralai_mistral-7b-instruct-v0.1
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size       =    0.11 MiB
llm_load_tensors: using ROCm/CUDA for GPU acceleration
llm_load_tensors: system memory used  =   70.42 MiB
llm_load_tensors: VRAM used           = 4095.05 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
...............................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: VRAM kv self = 64.00 MB
llama_new_context_with_model: KV self size  =   64.00 MiB, K (f16):   32.00 MiB, V (f16):   32.00 MiB
llama_build_graph: non-view tensors processed: 676/676
llama_new_context_with_model: compute buffer total size = 76.19 MiB
llama_new_context_with_model: VRAM scratch buffer: 73.00 MiB
llama_new_context_with_model: total VRAM used: 4232.06 MiB (model: 4095.05 MiB, context: 137.00 MiB)

system_info: n_threads = 8 / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
sampling: 
    repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
    top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
    mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp 
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0

 hello,
I am new to this forum and I want to start a discussion on the topic of what do you think is the most beautiful animal in nature. I am not looking for an answer but I would like to hear your opinions.
I personally believe that the peacock is one of the most beautiful animals in nature. Not only does it have stunning feathers, but its tail feathers are also very vibrant and colorful. They are also incredibly soft and delicate to touch. Additionally, the peacock's eyes are so large and deep that they give off an almost hypnotic effect. Overall, I think that the peacock is a truly magnificent animal that deserves all the admiration it gets.
What about you? What animal do you think is the most beautiful in nature? [end of text]

llama_print_timings:        load time =    4869.80 ms
llama_print_timings:      sample time =      30.13 ms /   165 runs   (    0.18 ms per token,  5477.00 tokens per second)
llama_print_timings: prompt eval time =    1844.29 ms /     3 tokens (  614.76 ms per token,     1.63 tokens per second)
llama_print_timings:        eval time =   12033.66 ms /   164 runs   (   73.38 ms per token,    13.63 tokens per second)
llama_print_timings:       total time =   13942.88 ms
Log end

Eval time is much faster, 2.3X times of CPU only mode.

lovenemesis commented 8 months ago

Following the similar method, it works on Fedora 39 with RX 7800XT, too:

HIP_VISIBLE_DEVICES=0 HSA_OVERRIDE_GFX_VERSION=11.0.1 ./main -ngl 35 --recompile --tinyblas -m /mnt/LINDATA/LLM/llamafile/mistral-7b-instruct-v0.1.Q4_K_M.gguf -p hello
import_cuda_impl: initializing gpu module...
get_rocm_bin_path: note: amdclang++ not found on $PATH
get_rocm_bin_path: note: $HIP_PATH/bin/amdclang++ does not exist
get_rocm_bin_path: note: /opt/rocm/bin/amdclang++ does not exist
get_rocm_bin_path: note: hipInfo not found on $PATH
get_rocm_bin_path: note: $HIP_PATH/bin/hipInfo does not exist
get_rocm_bin_path: note: /opt/rocm/bin/hipInfo does not exist
llamafile_log_command: /usr/bin/rocminfo
llamafile_log_command: hipcc -O3 -fPIC -shared -DNDEBUG --offload-arch=gfx1101 -march=native -mtune=native -use_fast_math -DGGML_BUILD=1 -DGGML_SHARED=1 -Wno-return-type -Wno-unused-result -DGGML_USE_HIPBLAS -DGGML_CUDA_MMV_Y=1 -DGGML_MULTIPLATFORM -DGGML_CUDA_DMMV_X=32 -DIGNORE4 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -DGGML_USE_TINYBLAS -o /home/tommy/.llamafile/ggml-rocm.so.pht5sx /home/tommy/.llamafile/ggml-cuda.cu -DIGNORE5 -DIGNORE6
/home/tommy/.llamafile/ggml-cuda.cu:5712:24: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
 5712 | static __global__ void soft_max_f32(const float * x, const float * y, float * dst, const int ncols_par, const int nrows_y, const float scale) {
      |                        ^
/home/tommy/.llamafile/ggml-cuda.cu:5712:24: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
/home/tommy/.llamafile/ggml-cuda.cu:5712:24: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
/home/tommy/.llamafile/ggml-cuda.cu:5712:24: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
/home/tommy/.llamafile/ggml-cuda.cu:5712:24: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
/home/tommy/.llamafile/ggml-cuda.cu:5712:24: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
6 warnings generated when compiling for gfx1101.
link_cuda_dso: note: dynamically linking /home/tommy/.llamafile/ggml-rocm.so
import_cuda: GPU support successfully linked and loaded
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 ROCm/CUDA devices:
  Device 0: AMD Radeon RX 7800 XT, compute capability 11.0, VMM: no
Log start
main: llamafile version 0.6.0
main: seed  = 1705367652
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /mnt/LINDATA/LLM/llamafile/mistral-7b-instruct-v0.1.Q4_K_M.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.1
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 15
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 4.07 GiB (4.83 BPW) 
llm_load_print_meta: general.name     = mistralai_mistral-7b-instruct-v0.1
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size       =    0.11 MiB
llm_load_tensors: using ROCm/CUDA for GPU acceleration
llm_load_tensors: system memory used  =   70.42 MiB
llm_load_tensors: VRAM used           = 4095.05 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
...............................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: VRAM kv self = 64.00 MB
llama_new_context_with_model: KV self size  =   64.00 MiB, K (f16):   32.00 MiB, V (f16):   32.00 MiB
llama_build_graph: non-view tensors processed: 676/676
llama_new_context_with_model: compute buffer total size = 76.19 MiB
llama_new_context_with_model: VRAM scratch buffer: 73.00 MiB
llama_new_context_with_model: total VRAM used: 4232.06 MiB (model: 4095.05 MiB, context: 137.00 MiB)

system_info: n_threads = 8 / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
sampling: 
    repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
    top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
    mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp 
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0

 hello there

my problem is that I am trying to create a custom action for the "Solve" button in the "Easy" tab of the "New Problem" screen.  I want the user to be able to select which type of equation they are working on, and then for the Solver to automatically solve the equations based on the selected type.  This is a simple concept, but I can't find any resources online that explain how to do this.

Can anyone provide me with some guidance on how to accomplish this?  Thank you! [end of text]

llama_print_timings:        load time =    1581.11 ms
llama_print_timings:      sample time =      14.44 ms /   116 runs   (    0.12 ms per token,  8034.91 tokens per second)
llama_print_timings: prompt eval time =     584.05 ms /     3 tokens (  194.68 ms per token,     5.14 tokens per second)
llama_print_timings:        eval time =    1940.87 ms /   115 runs   (   16.88 ms per token,    59.25 tokens per second)
llama_print_timings:       total time =    2550.87 ms
Log end

Eval time is 8X times of CPU only mode( 5700G is a weaker one compared to 7840U in terms of CPU performance).

hiepxanh commented 8 months ago

I get the same issue on AMD, first time report error, next time wont.

The file log below.

image

bug-short.md not-bug.short.md

Anyway you save my life. Great job @jart and your team. What you did is so amazing. I prefer this one more that: koboldcpp, ollama, llama.cpp, exllama, PowerInfer, candle, nitro... Oh my god so much.

❤️ ❤️ ❤️ ❤️ ❤️ I give you 1000 ❤️ for this project

jart commented 8 months ago

Anyway you save my life. Great job @jart and your team. What you did is so amazing. I prefer this one more that: koboldcpp, ollama, llama.cpp, exllama, PowerInfer, candle, nitro... Oh my god so much. ❤️ ❤️ ❤️ ❤️ ❤️ I give you 1000 ❤️ for this project

Happy to hear it! cc team: @stlhood @mrdomino @ahgamut @jammm

I get the same issue on AMD, first time report error, next time wont.

It looks like everything is working as intended and successful in the left side of your screenshot. tinyBLAS isn't as fast as rocBLAS yet, so we still try to build you a native rocBLAS library if you have ROCm installed. Then it falls back to the builtin tinyBLAS DSO when it couldn't be found. I'm not sure why it'd fall back to CPU inference for your first run though. I'll try uninstalling ROCm on my machine later to recreate that failure path.

lovenemesis commented 7 months ago

@jart

In #92, you mentioned positive result while using a RX 6800 on Windows. May I know if this card also works in your setup under Linux, assuming ROCm was installed via AMD packages on the official supported distro (Ubuntu 22.04)?

If your setup work while mine (ROCm SDK repacked in Fedora) and this #214 (ROCm SDK repacked in Arch) don't, this could be a packaging issue specific to those distros. I'm happy to let the distro packagers aware.

Thanks a lot for all the work!

hiepxanh commented 7 months ago

HIP issue should read about this: @lovenemesis this is for windows note. But you can take a look. They have AMD device which successful deploy on fedora

https://github.com/ggerganov/llama.cpp/pull/1087#issuecomment-1665757362 https://github.com/ggerganov/llama.cpp/pull/1087#issuecomment-1687254961 https://github.com/ggerganov/llama.cpp/pull/1087#issuecomment-1700502221

jart commented 7 months ago

In #92, you mentioned positive result while using a RX 6800 on Windows. May I know if this card also works in your setup under Linux, assuming ROCm was installed via AMD packages on the official supported distro (Ubuntu 22.04)?

My latest Linux computer has AMD Radeon RX 7900 XTX and runs Debian 12. Note that while Debian isn't listed in AMD's list of supported OSes, Ubuntu 22.04 is listed and Debian 12 is what it's based off, so I'd assume Debian is supported by the transitive property. https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html I haven't tried putting my AMD Radeon RX 6800 in my Linux computer. I would assume it would work fine if I did. I used the official ROCm installer.

Could you confirm that 67d97b520278232f88b852e8e6d61847ce0728b8 earlier was successful in automating the compilation for your Fedora environment? I'd assume that's all you needed. It hasn't been rolled out into a release yet.

lovenemesis commented 7 months ago

@jart

Yes, https://github.com/Mozilla-Ocho/llamafile/commit/67d97b520278232f88b852e8e6d61847ce0728b8 #188 one helps.

Besides this, it would be great to allow a fallback path to the built-in tinyBLAS kernel if hipBLAS/rocmBLAS didn't work under Linux, in similar to Windows. Hope I'm not asking too much.

I haven't tried putting my AMD Radeon RX 6800 in my Linux computer. I would assume it would work fine if I did. I used the official ROCm installer.

My thoughts behind this ask is that if that unlisted RX6800 works with official ROCm installer on Linux, the issue for my unlisted RX 7800XT not working may fall in the packaging side on Fedora. In this case, I could use a bit hint on the parts requiring further look.

At the beginning of my output, there are a few errors printed about:

get_rocm_bin_path: note: amdclang++ not found on $PATH
get_rocm_bin_path: note: $HIP_PATH/bin/amdclang++ does not exist
get_rocm_bin_path: note: /opt/rocm/bin/amdclang++ does not exist
get_rocm_bin_path: note: hipInfo not found on $PATH
get_rocm_bin_path: note: $HIP_PATH/bin/hipInfo does not exist
get_rocm_bin_path: note: /opt/rocm/bin/hipInfo does not exist

Was amdclang++ , an AMD optimized variant(?), being utilized to compile the kernel in your setup? In my setup, I'm pretty sure it's the standard clang 17.0.6 release in $PATH being utilized. Not sure if this is causing the trouble...

Meanwhile, I will try to install a Debian 12 or Ubuntu 22.04 on an external hard drive to test the official ROCm installer.

jart commented 7 months ago

I don't think we should color too far outside the lines AMD has drawn.

AMD Radeon RX 6800 is the lowest card in AMD's support vector for Windows for HIP SDK development https://rocm.docs.amd.com/projects/install-on-windows/en/latest/reference/system-requirements.html and llamafile is able to support a lot more cards than that on Windows because we ship prebuilt binaries that only depend on the graphics driver. AMD makes computers like this for Linux users:

image

But they'll let you squeak in with a Radeon RX 7900 XTX https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html llamafile won't ship prebuilt gpu dynamic shared objects to Linux users, so we depend on the HIP SDK being available. Therefore, any hardware more than that, which we somehow manage to support, is both a hack and a gift.

hiepxanh commented 7 months ago

@jart I have a weird behavior on TinyBlast, but I want to make sure that not a bug, can you take a look on this issue: https://github.com/ggerganov/llama.cpp/issues/3969#issuecomment-1897731810 I did a detail debug on that comment, but feel free to guide me if you want more detail debug

jart commented 7 months ago

@hiepxanh Have you tried setting the max_tokens parameter?

hiepxanh commented 7 months ago

@jart it seem not working, but I also see that it have issue on only few model, phi-2 work perfect. I belive it maybe by the model quantization issue. NVIDIA card have same issue. Anyway, your TinyBlast is so amazing. it even faster than CLBlast on 2 device I test with AMD card. I would like to open a Feature Request on llama.cpp to official support TinyBlast. Don't you mind?

cgmb commented 7 months ago

My thoughts behind this ask is that if that unlisted RX6800 works with official ROCm installer on Linux, the issue for my unlisted RX 7800XT not working may fall in the packaging side on Fedora. In this case, I could use a bit hint on the parts requiring further look.

The Fedora packagers might be interested to know if you're having trouble with rocBLAS on the RX 7800 XT. IIRC, that was one of the GPUs they were looking into using for their local workstations.

My latest Linux computer has AMD Radeon RX 7900 XTX and runs Debian 12. Note that while Debian isn't listed in AMD's list of supported OSes, Ubuntu 22.04 is listed and Debian 12 is what it's based off, so I'd assume Debian is supported by the transitive property. https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html I haven't tried putting my AMD Radeon RX 6800 in my Linux computer. I would assume it would work fine if I did. I used the official ROCm installer.

A little over a year ago, AMD donated four RX 6800 GPUs to the Debian AI team to help them with packaging. I didn't get rocBLAS packaged for Debian in time for Bookworm, but the RX 6800 is perhaps the best-tested AMD GPU for the ROCm packages on Trixie.

I am currently trying to migrate Debian's ROCm packages from LLVM 15 to LLVM 17 so we can enable RDNA 3 support on Trixie. Once that's done, I think we'll start looking at backports. I expect that you'll be able to install librocblas-dev enabled for all discrete RDNA 3 GPUs from bookworm-backports later this year.

hiepxanh commented 7 months ago

@cgmb I'm seeing you are working at AMD, which is really cool. I have a small thought, I know it stupid. But as I know, the AMD currently lack of resource so they decide to push on Enterprise which is fine. But for AMD user and fans, it will be great if they can test some model more easier. I'm thinking of TinyBlast could be some kind of fallback solution if ROCm not avaiable, on official document for AMD. Since it only appear on flagship. But you know, they will cannot invent on anything with a cheap card, but it will do a lot of inspration for kids, or people like me feel like, I can do some kind of AI things on my lovely card at home. Maybe I can show for my kids or my parent or something? It's not about performance, it is something like I can do AI on AMD thing. like that. I belive TinyBlast can faster CPU a lot also. Finally, it easier for AMD to marketing like, "we have new support feature which allow AI run on any AMD's card instead of flagship.". Which will bring huge fan coming. (But we not guaranty it run fast or run correct LOL 😄 ) I also see a lot of normal user which have poor card can do amazing thing.... This will be help a lot

jart commented 7 months ago

Even if it were possible to distribute prebuilt tinyBLAS .so files that worked on multiple Linux distros, we simply don't have room. There's 4.5mb of space left in the the LLaVA llamafile, which is 3.995557952672243 gibibbytes. We'd need another 23mb at least to distribute .so files for AMD and NVIDIA.

ajbouh commented 7 months ago

Now may not be the time for it (and this likely isn't the thread for it), but on-GPU decompression is a thing that games have cared about for a while.

It's possible we can benefit from those same routines. We might be able to compress the weights during preprocessing and then decompress them at load time.

This would of course complicate a number of existing code paths, but there may come a time when it's worth the trade.

jart commented 7 months ago

@ajbouh tinyBLAS carves weight tensors up into 2d blocks. @ahgamut what is the bottleneck on gpu right now? is it computation or memory bandwidth? If it's the latter, then would it be possible to read variable-length 2d blocks, huffman decode them, run length decode them, apply a zigzag transform, then an inverse discrete cosine transform, and finally q8 dequantize?

ahgamut commented 7 months ago

We'd have to profile to be sure. GemmEx is most likely bottlenecked on compute, because of some large matrices -- I'm not sure how we can better parallelize when k is very large.

ajbouh commented 7 months ago

Loading times are almost certainly bottlenecked by bandwidth, aren't they? There's also the challenge of fitting below the 4GB limit that windows users are facing.

It would be a cool thing to reuse the same texture compression techniques that game developers have refined to push the boundaries of what's possible in AI consumer hardware

jart commented 7 months ago

If we could just gzip the byte stream I'd do it, but the entropy is too high for weights. OTOH using texture compression or other forms of image compression would require changing the gguf file format I think. That's not worth it for optimizing loading time IMHO.

What I'd like to see happen is for GPUs to support mmap() where the kernel can track read-only pages that map to a file on disk. That would turn loading into a one time cost when memory is free. Right now the closest thing I've seen to being able to do that is devices like Jetson that have unified memory. Then I can just say:

  void *map = mmap(0, N*sizeof(int), PROT_READ, MAP_SHARED, fd, 0);
  CUDA_OR_DIE(cudaHostRegister(map, N*sizeof(int), CU_MEMHOSTREGISTER_READ_ONLY));
  CUDA_OR_DIE(cudaHostGetDevicePointer((void **)&x, map, 0));

And it loads almost as fast as mmap(). I want it because I'd rather have ephemeral processes with a smart kernel, than needing to run my own daemon that I talk to over HTTP/JSON.

ajbouh commented 7 months ago

At the risk of crossing the streams, it seems that Nvidia claims to have a variety of standard compression algorithms that run on GPU including snappy, zstd, and lz4: https://developer.nvidia.com/nvcomp

On Sat, Jan 20, 2024, 14:57 Justine Tunney @.***> wrote:

If we could just gzip the byte stream I'd do it. But using texture compression or other forms of image compression would require changing the gguf file format I think. That's not worth it for optimizing loading time IMHO.

What I'd like to see happen is for GPUs to support to mmap() where the kernel can track read-only pages that map to a file on disk. That would turn loading into a one time cost when memory is free. Right now the closest thing I've seen to being able to do that is devices like Jetson that have unified memory. Then I can just say:

void map = mmap(0, Nsizeof(int), PROT_READ, MAP_SHARED, fd, 0); CUDA_OR_DIE(cudaHostRegister(map, N*sizeof(int), CU_MEMHOSTREGISTER_READ_ONLY)); CUDA_OR_DIE(cudaHostGetDevicePointer((void **)&x, map, 0));

And it loads almost as fast as mmap(). I want it because I'd rather have ephemeral processes with a smart kernel, than needing to run my own daemon that I talk to over HTTP/JSON.

— Reply to this email directly, view it on GitHub https://github.com/Mozilla-Ocho/llamafile/issues/188#issuecomment-1902251870, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAABZHOKBQ42USMKCVEVUSDYPQOSZAVCNFSM6AAAAABBUYZIAKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBSGI2TCOBXGA . You are receiving this because you were mentioned.Message ID: @.***>

lovenemesis commented 7 months ago

Even if it were possible to distribute prebuilt tinyBLAS .so files that worked on multiple Linux distros, we simply don't have room. There's 4.5mb of space left in the the LLaVA llamafile, which is 3.995557952672243 gibibbytes. We'd need another 23mb at least to distribute .so files for AMD and NVIDIA.

Understand the space limitation here. I guess that's also why below fails on release file:

./llamafile-0.6.1 -ngl 35 --recompile --tinyblas -m mistral-7b-instruct-v0.1.Q4_K_M.gguf
error: unknown argument: --tinyblas
usage: ./llamafile-0.6.1 [options]

I honestly didn't know the supported GPU list differs a lot between Windows and Linux from AMD. As 6800 is listed while 7800XT isn't, I wouldn't ask too much.

As always, this is THE most accessible method to taste LLM and I already recommend it to all my friends who want to learn. Thanks a lot!

hiepxanh commented 7 months ago

As I remmember 7800 should work with ROCm 6.0, the 5.7 is not support yet

lovenemesis commented 7 months ago

As I remmember 7800 should work with ROCm 6.0, the 5.7 is not support yet

Hmm, that's a potential reason, though it works just fine with PyTorch + ROCm 5.7. Nevertheless, I will try it the again on Fedora 40 where the ROCm 6.0 is packaged.

lovenemesis commented 4 months ago

After upgrading to latest ROCm on Fedora 40 as well as the llamafile 0.8.4, things finally appear to be working on this 7840U with 780M(gfx1103):

HSA_OVERRIDE_GFX_VERSION=11.0.0 ./Phi-3-mini-4k-instruct.F16.llamafile -ngl 999 --nocompile
import_cuda_impl: initializing gpu module...
extracting /zip/llama.cpp/ggml.h to /home/tommy/.llamafile/ggml.h
extracting /zip/llamafile/compcap.cu to /home/tommy/.llamafile/compcap.cu
extracting /zip/llamafile/llamafile.h to /home/tommy/.llamafile/llamafile.h
extracting /zip/llamafile/tinyblas.h to /home/tommy/.llamafile/tinyblas.h
extracting /zip/llamafile/tinyblas.cu to /home/tommy/.llamafile/tinyblas.cu
extracting /zip/llama.cpp/ggml-impl.h to /home/tommy/.llamafile/ggml-impl.h
extracting /zip/llama.cpp/ggml-cuda.h to /home/tommy/.llamafile/ggml-cuda.h
extracting /zip/llama.cpp/ggml-alloc.h to /home/tommy/.llamafile/ggml-alloc.h
extracting /zip/llama.cpp/ggml-common.h to /home/tommy/.llamafile/ggml-common.h
extracting /zip/llama.cpp/ggml-backend.h to /home/tommy/.llamafile/ggml-backend.h
extracting /zip/llama.cpp/ggml-backend-impl.h to /home/tommy/.llamafile/ggml-backend-impl.h
extracting /zip/llama.cpp/ggml-cuda.cu to /home/tommy/.llamafile/ggml-cuda.cu
get_rocm_bin_path: note: amdclang++ not found on $PATH
get_rocm_bin_path: note: $HIP_PATH/bin/amdclang++ does not exist
get_rocm_bin_path: note: /opt/rocm/bin/amdclang++ does not exist
get_rocm_bin_path: note: hipInfo not found on $PATH
get_rocm_bin_path: note: $HIP_PATH/bin/hipInfo does not exist
get_rocm_bin_path: note: /opt/rocm/bin/hipInfo does not exist
llamafile_log_command: /usr/bin/rocminfo
llamafile_log_command: hipcc -O3 -fPIC -shared -DNDEBUG --offload-arch=gfx1100 -march=native -mtune=native -DGGML_BUILD=1 -DGGML_SHARED=1 -Wno-return-type -Wno-unused-result -DGGML_USE_HIPBLAS -DGGML_CUDA_MMV_Y=1 -DGGML_MULTIPLATFORM -DGGML_CUDA_DMMV_X=32 -DIGNORE4 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -DIGNORE -o /home/tommy/.llamafile/ggml-rocm.so.t79wnb /home/tommy/.llamafile/ggml-cuda.cu -lhipblas -lrocblas
/home/tommy/.llamafile/ggml-cuda.cu:418:1: warning: function declared 'noreturn' should not return [-Winvalid-noreturn]
  418 | }
      | ^
/home/tommy/.llamafile/ggml-cuda.cu:501:5: warning: macro expansion producing 'defined' has undefined behavior [-Wexpansion-to-defined]
  501 | #if FP16_MMA_AVAILABLE
      |     ^
/home/tommy/.llamafile/ggml-cuda.cu:500:31: note: expanded from macro 'FP16_MMA_AVAILABLE'
  500 | #define FP16_MMA_AVAILABLE (!(defined(GGML_USE_HIPBLAS) && defined(__HIP_PLATFORM_AMD__)) && __CUDA_ARCH__ >= CC_VOLTA)
      |                               ^
/home/tommy/.llamafile/ggml-cuda.cu:501:5: warning: macro expansion producing 'defined' has undefined behavior [-Wexpansion-to-defined]
/home/tommy/.llamafile/ggml-cuda.cu:500:60: note: expanded from macro 'FP16_MMA_AVAILABLE'
  500 | #define FP16_MMA_AVAILABLE (!(defined(GGML_USE_HIPBLAS) && defined(__HIP_PLATFORM_AMD__)) && __CUDA_ARCH__ >= CC_VOLTA)
      |                                                            ^
/home/tommy/.llamafile/ggml-cuda.cu:505:41: warning: macro expansion producing 'defined' has undefined behavior [-Wexpansion-to-defined]
  505 | #if defined(GGML_MINIMIZE_CODE_SIZE) && FP16_AVAILABLE // [jart]
      |                                         ^
/home/tommy/.llamafile/ggml-cuda.cu:498:25: note: expanded from macro 'FP16_AVAILABLE'
  498 | #define FP16_AVAILABLE (defined(GGML_USE_HIPBLAS) && defined(__HIP_PLATFORM_AMD__) ? \
      |                         ^
/home/tommy/.llamafile/ggml-cuda.cu:505:41: warning: macro expansion producing 'defined' has undefined behavior [-Wexpansion-to-defined]
/home/tommy/.llamafile/ggml-cuda.cu:498:54: note: expanded from macro 'FP16_AVAILABLE'
  498 | #define FP16_AVAILABLE (defined(GGML_USE_HIPBLAS) && defined(__HIP_PLATFORM_AMD__) ? \
      |                                                      ^
/home/tommy/.llamafile/ggml-cuda.cu:505:41: warning: macro expansion producing 'defined' has undefined behavior [-Wexpansion-to-defined]
/home/tommy/.llamafile/ggml-cuda.cu:499:25: note: expanded from macro 'FP16_AVAILABLE'
  499 |                         defined(RDNA1) || defined(RDNA2) || defined(RDNA3) : __CUDA_ARCH__ >= CC_PASCAL)
      |                         ^
/home/tommy/.llamafile/ggml-cuda.cu:505:41: warning: macro expansion producing 'defined' has undefined behavior [-Wexpansion-to-defined]
/home/tommy/.llamafile/ggml-cuda.cu:499:43: note: expanded from macro 'FP16_AVAILABLE'
  499 |                         defined(RDNA1) || defined(RDNA2) || defined(RDNA3) : __CUDA_ARCH__ >= CC_PASCAL)
      |                                           ^
/home/tommy/.llamafile/ggml-cuda.cu:505:41: warning: macro expansion producing 'defined' has undefined behavior [-Wexpansion-to-defined]
/home/tommy/.llamafile/ggml-cuda.cu:499:61: note: expanded from macro 'FP16_AVAILABLE'
  499 |                         defined(RDNA1) || defined(RDNA2) || defined(RDNA3) : __CUDA_ARCH__ >= CC_PASCAL)
      |                                                             ^
/home/tommy/.llamafile/ggml-cuda.cu:5081:5: warning: macro expansion producing 'defined' has undefined behavior [-Wexpansion-to-defined]
 5081 | #if FP16_AVAILABLE
      |     ^
/home/tommy/.llamafile/ggml-cuda.cu:498:25: note: expanded from macro 'FP16_AVAILABLE'
  498 | #define FP16_AVAILABLE (defined(GGML_USE_HIPBLAS) && defined(__HIP_PLATFORM_AMD__) ? \
      |                         ^
/home/tommy/.llamafile/ggml-cuda.cu:5081:5: warning: macro expansion producing 'defined' has undefined behavior [-Wexpansion-to-defined]
/home/tommy/.llamafile/ggml-cuda.cu:498:54: note: expanded from macro 'FP16_AVAILABLE'
  498 | #define FP16_AVAILABLE (defined(GGML_USE_HIPBLAS) && defined(__HIP_PLATFORM_AMD__) ? \
      |                                                      ^
/home/tommy/.llamafile/ggml-cuda.cu:5081:5: warning: macro expansion producing 'defined' has undefined behavior [-Wexpansion-to-defined]
/home/tommy/.llamafile/ggml-cuda.cu:499:25: note: expanded from macro 'FP16_AVAILABLE'
  499 |                         defined(RDNA1) || defined(RDNA2) || defined(RDNA3) : __CUDA_ARCH__ >= CC_PASCAL)
      |                         ^
/home/tommy/.llamafile/ggml-cuda.cu:5081:5: warning: macro expansion producing 'defined' has undefined behavior [-Wexpansion-to-defined]
/home/tommy/.llamafile/ggml-cuda.cu:499:43: note: expanded from macro 'FP16_AVAILABLE'
  499 |                         defined(RDNA1) || defined(RDNA2) || defined(RDNA3) : __CUDA_ARCH__ >= CC_PASCAL)
      |                                           ^
/home/tommy/.llamafile/ggml-cuda.cu:5081:5: warning: macro expansion producing 'defined' has undefined behavior [-Wexpansion-to-defined]
/home/tommy/.llamafile/ggml-cuda.cu:499:61: note: expanded from macro 'FP16_AVAILABLE'
  499 |                         defined(RDNA1) || defined(RDNA2) || defined(RDNA3) : __CUDA_ARCH__ >= CC_PASCAL)
      |                                                             ^
/home/tommy/.llamafile/ggml-cuda.cu:5260:5: warning: macro expansion producing 'defined' has undefined behavior [-Wexpansion-to-defined]
 5260 | #if FP16_MMA_AVAILABLE
      |     ^
/home/tommy/.llamafile/ggml-cuda.cu:500:31: note: expanded from macro 'FP16_MMA_AVAILABLE'
  500 | #define FP16_MMA_AVAILABLE (!(defined(GGML_USE_HIPBLAS) && defined(__HIP_PLATFORM_AMD__)) && __CUDA_ARCH__ >= CC_VOLTA)
      |                               ^
/home/tommy/.llamafile/ggml-cuda.cu:5260:5: warning: macro expansion producing 'defined' has undefined behavior [-Wexpansion-to-defined]
/home/tommy/.llamafile/ggml-cuda.cu:500:60: note: expanded from macro 'FP16_MMA_AVAILABLE'
  500 | #define FP16_MMA_AVAILABLE (!(defined(GGML_USE_HIPBLAS) && defined(__HIP_PLATFORM_AMD__)) && __CUDA_ARCH__ >= CC_VOLTA)
      |                                                            ^
/home/tommy/.llamafile/ggml-cuda.cu:5618:5: warning: macro expansion producing 'defined' has undefined behavior [-Wexpansion-to-defined]
 5618 | #if FP16_AVAILABLE
      |     ^
/home/tommy/.llamafile/ggml-cuda.cu:498:25: note: expanded from macro 'FP16_AVAILABLE'
  498 | #define FP16_AVAILABLE (defined(GGML_USE_HIPBLAS) && defined(__HIP_PLATFORM_AMD__) ? \
      |                         ^
/home/tommy/.llamafile/ggml-cuda.cu:5618:5: warning: macro expansion producing 'defined' has undefined behavior [-Wexpansion-to-defined]
/home/tommy/.llamafile/ggml-cuda.cu:498:54: note: expanded from macro 'FP16_AVAILABLE'
  498 | #define FP16_AVAILABLE (defined(GGML_USE_HIPBLAS) && defined(__HIP_PLATFORM_AMD__) ? \
      |                                                      ^
/home/tommy/.llamafile/ggml-cuda.cu:5618:5: warning: macro expansion producing 'defined' has undefined behavior [-Wexpansion-to-defined]
/home/tommy/.llamafile/ggml-cuda.cu:499:25: note: expanded from macro 'FP16_AVAILABLE'
  499 |                         defined(RDNA1) || defined(RDNA2) || defined(RDNA3) : __CUDA_ARCH__ >= CC_PASCAL)
      |                         ^
/home/tommy/.llamafile/ggml-cuda.cu:5618:5: warning: macro expansion producing 'defined' has undefined behavior [-Wexpansion-to-defined]
/home/tommy/.llamafile/ggml-cuda.cu:499:43: note: expanded from macro 'FP16_AVAILABLE'
  499 |                         defined(RDNA1) || defined(RDNA2) || defined(RDNA3) : __CUDA_ARCH__ >= CC_PASCAL)
      |                                           ^
/home/tommy/.llamafile/ggml-cuda.cu:5618:5: warning: macro expansion producing 'defined' has undefined behavior [-Wexpansion-to-defined]
/home/tommy/.llamafile/ggml-cuda.cu:499:61: note: expanded from macro 'FP16_AVAILABLE'
  499 |                         defined(RDNA1) || defined(RDNA2) || defined(RDNA3) : __CUDA_ARCH__ >= CC_PASCAL)
      |                                                             ^
/home/tommy/.llamafile/ggml-cuda.cu:9679:24: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
 9679 | static __global__ void soft_max_f32(const float * x, const T * mask, const T * pos, float * dst, const int ncols_par, const int nrows_y, const float scale, const float max_bias, const float m0, const float m1, uint32_t n_head_log2) {
      |                        ^
/home/tommy/.llamafile/ggml-cuda.cu:9679:24: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
/home/tommy/.llamafile/ggml-cuda.cu:9679:24: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
/home/tommy/.llamafile/ggml-cuda.cu:9679:24: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
/home/tommy/.llamafile/ggml-cuda.cu:9679:24: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
/home/tommy/.llamafile/ggml-cuda.cu:9679:24: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
/home/tommy/.llamafile/ggml-cuda.cu:9679:24: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
/home/tommy/.llamafile/ggml-cuda.cu:9679:24: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
/home/tommy/.llamafile/ggml-cuda.cu:9679:24: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
/home/tommy/.llamafile/ggml-cuda.cu:9679:24: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
/home/tommy/.llamafile/ggml-cuda.cu:9679:24: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
/home/tommy/.llamafile/ggml-cuda.cu:9679:24: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
32 warnings generated when compiling for gfx1100.
/home/tommy/.llamafile/ggml-cuda.cu:418:1: warning: function declared 'noreturn' should not return [-Winvalid-noreturn]
  418 | }
      | ^
/home/tommy/.llamafile/ggml-cuda.cu:501:5: warning: macro expansion producing 'defined' has undefined behavior [-Wexpansion-to-defined]
  501 | #if FP16_MMA_AVAILABLE
      |     ^
/home/tommy/.llamafile/ggml-cuda.cu:500:31: note: expanded from macro 'FP16_MMA_AVAILABLE'
  500 | #define FP16_MMA_AVAILABLE (!(defined(GGML_USE_HIPBLAS) && defined(__HIP_PLATFORM_AMD__)) && __CUDA_ARCH__ >= CC_VOLTA)
      |                               ^
/home/tommy/.llamafile/ggml-cuda.cu:501:5: warning: macro expansion producing 'defined' has undefined behavior [-Wexpansion-to-defined]
/home/tommy/.llamafile/ggml-cuda.cu:500:60: note: expanded from macro 'FP16_MMA_AVAILABLE'
  500 | #define FP16_MMA_AVAILABLE (!(defined(GGML_USE_HIPBLAS) && defined(__HIP_PLATFORM_AMD__)) && __CUDA_ARCH__ >= CC_VOLTA)
      |                                                            ^
/home/tommy/.llamafile/ggml-cuda.cu:505:41: warning: macro expansion producing 'defined' has undefined behavior [-Wexpansion-to-defined]
  505 | #if defined(GGML_MINIMIZE_CODE_SIZE) && FP16_AVAILABLE // [jart]
      |                                         ^
/home/tommy/.llamafile/ggml-cuda.cu:498:25: note: expanded from macro 'FP16_AVAILABLE'
  498 | #define FP16_AVAILABLE (defined(GGML_USE_HIPBLAS) && defined(__HIP_PLATFORM_AMD__) ? \
      |                         ^
/home/tommy/.llamafile/ggml-cuda.cu:505:41: warning: macro expansion producing 'defined' has undefined behavior [-Wexpansion-to-defined]
/home/tommy/.llamafile/ggml-cuda.cu:498:54: note: expanded from macro 'FP16_AVAILABLE'
  498 | #define FP16_AVAILABLE (defined(GGML_USE_HIPBLAS) && defined(__HIP_PLATFORM_AMD__) ? \
      |                                                      ^
/home/tommy/.llamafile/ggml-cuda.cu:505:41: warning: macro expansion producing 'defined' has undefined behavior [-Wexpansion-to-defined]
/home/tommy/.llamafile/ggml-cuda.cu:499:25: note: expanded from macro 'FP16_AVAILABLE'
  499 |                         defined(RDNA1) || defined(RDNA2) || defined(RDNA3) : __CUDA_ARCH__ >= CC_PASCAL)
      |                         ^
/home/tommy/.llamafile/ggml-cuda.cu:505:41: warning: macro expansion producing 'defined' has undefined behavior [-Wexpansion-to-defined]
/home/tommy/.llamafile/ggml-cuda.cu:499:43: note: expanded from macro 'FP16_AVAILABLE'
  499 |                         defined(RDNA1) || defined(RDNA2) || defined(RDNA3) : __CUDA_ARCH__ >= CC_PASCAL)
      |                                           ^
/home/tommy/.llamafile/ggml-cuda.cu:505:41: warning: macro expansion producing 'defined' has undefined behavior [-Wexpansion-to-defined]
/home/tommy/.llamafile/ggml-cuda.cu:499:61: note: expanded from macro 'FP16_AVAILABLE'
  499 |                         defined(RDNA1) || defined(RDNA2) || defined(RDNA3) : __CUDA_ARCH__ >= CC_PASCAL)
      |                                                             ^
/home/tommy/.llamafile/ggml-cuda.cu:5081:5: warning: macro expansion producing 'defined' has undefined behavior [-Wexpansion-to-defined]
 5081 | #if FP16_AVAILABLE
      |     ^
/home/tommy/.llamafile/ggml-cuda.cu:498:25: note: expanded from macro 'FP16_AVAILABLE'
  498 | #define FP16_AVAILABLE (defined(GGML_USE_HIPBLAS) && defined(__HIP_PLATFORM_AMD__) ? \
      |                         ^
/home/tommy/.llamafile/ggml-cuda.cu:5081:5: warning: macro expansion producing 'defined' has undefined behavior [-Wexpansion-to-defined]
/home/tommy/.llamafile/ggml-cuda.cu:498:54: note: expanded from macro 'FP16_AVAILABLE'
  498 | #define FP16_AVAILABLE (defined(GGML_USE_HIPBLAS) && defined(__HIP_PLATFORM_AMD__) ? \
      |                                                      ^
/home/tommy/.llamafile/ggml-cuda.cu:5081:5: warning: macro expansion producing 'defined' has undefined behavior [-Wexpansion-to-defined]
/home/tommy/.llamafile/ggml-cuda.cu:499:25: note: expanded from macro 'FP16_AVAILABLE'
  499 |                         defined(RDNA1) || defined(RDNA2) || defined(RDNA3) : __CUDA_ARCH__ >= CC_PASCAL)
      |                         ^
/home/tommy/.llamafile/ggml-cuda.cu:5081:5: warning: macro expansion producing 'defined' has undefined behavior [-Wexpansion-to-defined]
/home/tommy/.llamafile/ggml-cuda.cu:499:43: note: expanded from macro 'FP16_AVAILABLE'
  499 |                         defined(RDNA1) || defined(RDNA2) || defined(RDNA3) : __CUDA_ARCH__ >= CC_PASCAL)
      |                                           ^
/home/tommy/.llamafile/ggml-cuda.cu:5081:5: warning: macro expansion producing 'defined' has undefined behavior [-Wexpansion-to-defined]
/home/tommy/.llamafile/ggml-cuda.cu:499:61: note: expanded from macro 'FP16_AVAILABLE'
  499 |                         defined(RDNA1) || defined(RDNA2) || defined(RDNA3) : __CUDA_ARCH__ >= CC_PASCAL)
      |                                                             ^
/home/tommy/.llamafile/ggml-cuda.cu:5260:5: warning: macro expansion producing 'defined' has undefined behavior [-Wexpansion-to-defined]
 5260 | #if FP16_MMA_AVAILABLE
      |     ^
/home/tommy/.llamafile/ggml-cuda.cu:500:31: note: expanded from macro 'FP16_MMA_AVAILABLE'
  500 | #define FP16_MMA_AVAILABLE (!(defined(GGML_USE_HIPBLAS) && defined(__HIP_PLATFORM_AMD__)) && __CUDA_ARCH__ >= CC_VOLTA)
      |                               ^
/home/tommy/.llamafile/ggml-cuda.cu:5260:5: warning: macro expansion producing 'defined' has undefined behavior [-Wexpansion-to-defined]
/home/tommy/.llamafile/ggml-cuda.cu:500:60: note: expanded from macro 'FP16_MMA_AVAILABLE'
  500 | #define FP16_MMA_AVAILABLE (!(defined(GGML_USE_HIPBLAS) && defined(__HIP_PLATFORM_AMD__)) && __CUDA_ARCH__ >= CC_VOLTA)
      |                                                            ^
/home/tommy/.llamafile/ggml-cuda.cu:5618:5: warning: macro expansion producing 'defined' has undefined behavior [-Wexpansion-to-defined]
 5618 | #if FP16_AVAILABLE
      |     ^
/home/tommy/.llamafile/ggml-cuda.cu:498:25: note: expanded from macro 'FP16_AVAILABLE'
  498 | #define FP16_AVAILABLE (defined(GGML_USE_HIPBLAS) && defined(__HIP_PLATFORM_AMD__) ? \
      |                         ^
/home/tommy/.llamafile/ggml-cuda.cu:5618:5: warning: macro expansion producing 'defined' has undefined behavior [-Wexpansion-to-defined]
/home/tommy/.llamafile/ggml-cuda.cu:498:54: note: expanded from macro 'FP16_AVAILABLE'
  498 | #define FP16_AVAILABLE (defined(GGML_USE_HIPBLAS) && defined(__HIP_PLATFORM_AMD__) ? \
      |                                                      ^
/home/tommy/.llamafile/ggml-cuda.cu:5618:5: warning: macro expansion producing 'defined' has undefined behavior [-Wexpansion-to-defined]
/home/tommy/.llamafile/ggml-cuda.cu:499:25: note: expanded from macro 'FP16_AVAILABLE'
  499 |                         defined(RDNA1) || defined(RDNA2) || defined(RDNA3) : __CUDA_ARCH__ >= CC_PASCAL)
      |                         ^
/home/tommy/.llamafile/ggml-cuda.cu:5618:5: warning: macro expansion producing 'defined' has undefined behavior [-Wexpansion-to-defined]
/home/tommy/.llamafile/ggml-cuda.cu:499:43: note: expanded from macro 'FP16_AVAILABLE'
  499 |                         defined(RDNA1) || defined(RDNA2) || defined(RDNA3) : __CUDA_ARCH__ >= CC_PASCAL)
      |                                           ^
/home/tommy/.llamafile/ggml-cuda.cu:5618:5: warning: macro expansion producing 'defined' has undefined behavior [-Wexpansion-to-defined]
/home/tommy/.llamafile/ggml-cuda.cu:499:61: note: expanded from macro 'FP16_AVAILABLE'
  499 |                         defined(RDNA1) || defined(RDNA2) || defined(RDNA3) : __CUDA_ARCH__ >= CC_PASCAL)
      |                                                             ^
20 warnings generated when compiling for host.
link_cuda_dso: note: dynamically linking /home/tommy/.llamafile/ggml-rocm.so
ggml_cuda_link: welcome to ROCm SDK with hipBLAS
link_cuda_dso: GPU support loaded
{"build":1500,"commit":"a30b324","function":"server_cli","level":"INFO","line":2856,"msg":"build info","tid":"11165056","timestamp":1715576522}
{"function":"server_cli","level":"INFO","line":2859,"msg":"system info","n_threads":8,"n_threads_batch":-1,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | ","tid":"11165056","timestamp":1715576522,"total_threads":16}
llama_model_loader: loaded meta data with 23 key-value pairs and 195 tensors from Phi-3-mini-4k-instruct.F16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = phi3
llama_model_loader: - kv   1:                               general.name str              = Phi3
llama_model_loader: - kv   2:                        phi3.context_length u32              = 4096
llama_model_loader: - kv   3:                      phi3.embedding_length u32              = 3072
llama_model_loader: - kv   4:                   phi3.feed_forward_length u32              = 8192
llama_model_loader: - kv   5:                           phi3.block_count u32              = 32
llama_model_loader: - kv   6:                  phi3.attention.head_count u32              = 32
llama_model_loader: - kv   7:               phi3.attention.head_count_kv u32              = 32
llama_model_loader: - kv   8:      phi3.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv   9:                  phi3.rope.dimension_count u32              = 96
llama_model_loader: - kv  10:                          general.file_type u32              = 1
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,32064]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,32064]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,32064]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 32000
llama_model_loader: - kv  17:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  18:            tokenizer.ggml.padding_token_id u32              = 32000
llama_model_loader: - kv  19:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  20:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  21:                    tokenizer.chat_template str              = {{ bos_token }}{% for message in mess...
llama_model_loader: - kv  22:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type  f16:  130 tensors
llm_load_vocab: special tokens definition check successful ( 323/32064 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = phi3
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32064
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 3072
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 96
llm_load_print_meta: n_embd_head_k    = 96
llm_load_print_meta: n_embd_head_v    = 96
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 3072
llm_load_print_meta: n_embd_v_gqa     = 3072
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 8192
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 3B
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 3.82 B
llm_load_print_meta: model size       = 7.12 GiB (16.00 BPW) 
llm_load_print_meta: general.name     = Phi3
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 32000 '<|endoftext|>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 32000 '<|endoftext|>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_print_meta: EOT token        = 32007 '<|end|>'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, compute capability 11.0, VMM: no
llm_load_tensors: ggml ctx size =    0.22 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:      ROCm0 buffer size =  7100.64 MiB
llm_load_tensors:        CPU buffer size =   187.88 MiB
........................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      ROCm0 KV buffer size =   192.00 MiB
llama_new_context_with_model: KV self size  =  192.00 MiB, K (f16):   96.00 MiB, V (f16):   96.00 MiB
llama_new_context_with_model:  ROCm_Host  output buffer size =     0.13 MiB
llama_new_context_with_model:      ROCm0 compute buffer size =    83.00 MiB
llama_new_context_with_model:  ROCm_Host compute buffer size =     7.01 MiB
llama_new_context_with_model: graph nodes  = 1286
llama_new_context_with_model: graph splits = 2
{"function":"initialize","level":"INFO","line":489,"msg":"initializing slots","n_slots":1,"tid":"11165056","timestamp":1715576523}
{"function":"initialize","level":"INFO","line":498,"msg":"new slot","n_ctx_slot":512,"slot_id":0,"tid":"11165056","timestamp":1715576523}
{"function":"server_cli","level":"INFO","line":3077,"msg":"model loaded","tid":"11165056","timestamp":1715576523}

llama server listening at http://127.0.0.1:8080

opening browser tab... (pass --nobrowser to disable)
{"function":"server_cli","hostname":"127.0.0.1","level":"INFO","line":3200,"msg":"HTTP server listening","port":"8080","tid":"11165056","timestamp":1715576523}
{"function":"update_slots","level":"INFO","line":1659,"msg":"all slots are idle and system prompt is empty, clear the KV cache","tid":"11165056","timestamp":1715576523}
{"function":"log_server_request","level":"INFO","line":2781,"method":"GET","msg":"request","params":{},"path":"/","remote_addr":"127.0.0.1","remote_port":34342,"status":200,"tid":"17594337061360","timestamp":1715576523}
{"function":"log_server_request","level":"INFO","line":2781,"method":"GET","msg":"request","params":{},"path":"/","remote_addr":"127.0.0.1","remote_port":34356,"status":200,"tid":"17594337070528","timestamp":1715576523}
{"function":"log_server_request","level":"INFO","line":2781,"method":"GET","msg":"request","params":{},"path":"/index.js","remote_addr":"127.0.0.1","remote_port":34342,"status":200,"tid":"17594337061360","timestamp":1715576523}
{"function":"log_server_request","level":"INFO","line":2781,"method":"GET","msg":"request","params":{},"path":"/completion.js","remote_addr":"127.0.0.1","remote_port":34342,"status":200,"tid":"17594337061360","timestamp":1715576523}
{"function":"log_server_request","level":"INFO","line":2781,"method":"GET","msg":"request","params":{},"path":"/json-schema-to-grammar.mjs","remote_addr":"127.0.0.1","remote_port":34374,"status":200,"tid":"17594337081936","timestamp":1715576523}

Thanks @jart and the team's effort to bundle AMD GPU Linux support in TinyBlas.

Djip007 commented 3 months ago

For me use gfx1101 on gfx1103 (Ryzen 7940HX) is faster than gfx1100 (with rocblas/hipblas...) I made a PR https://github.com/ggerganov/llama.cpp/pull/7414 to speed-up when use of UMA memorie. I now need to make a PR here to allow use it with llamafile (adding args --use-hip-uma ?)