Mozilla-Ocho / llamafile

Distribute and run LLMs with a single file.
https://llamafile.ai
Other
20.58k stars 1.04k forks source link

Bug: GPU Acceleration works for one but not the other users on same Linux machine #609

Open lovenemesis opened 2 weeks ago

lovenemesis commented 2 weeks ago

What happened?

On the same Fedora 41 machine with AMD 7800XT, the llamafile was able to leverage GPU acceleration with one user, but fall backing to CPU inference if switching to another user. The llamafile engine, script and weight remain identical between those two users.

The script I used to launch llamafile is fairly straigforward:

#!/bin/bash

GGUF_FOLDER=/mnt/LINDATA/LLM/GGUF

if ! [ -d $GGUF_FOLDER ] ; then
    echo "GGUF folder does not exist"
    exit 1
else 
    cd $GGUF_FOLDER
    HSA_OVERRIDE_GFX_VERSION=11.0.0 /usr/local/bin/llamafile -m qwen2.5-14b-instruct-q5_k_m.gguf --server --nobrowser --log-disable -ngl 999 --nocompile
fi

However, calling the same script from user1 can activate GPU while user2 always fallback to CPU. I had tried to delete the .llamafile from home directory of user2. But it doesn't appear to fix anything.

I have always fully log-out from user1 and log-in to user2 to avoid potential device lock. And both users are members of video group.

I am a bit clueless on which part of configuration difference between 2 users could cause this. Appreciate any help!

Version

llamafile 0.8.15

What operating system are you seeing the problem on?

Linux

Relevant log output

$ sudo id -nG user1
user1 wheel video

$./llamafile-qwen2.sh
{"function":"server_params_parse","level":"INFO","line":2724,"msg":"logging to file is disabled.","tid":"11804224","timestamp":1730675653}
import_cuda_impl: initializing gpu module...
link_cuda_dso: note: dynamically linking /home/user1/.llamafile/v/0.8.15/ggml-rocm.so
ggml_cuda_link: welcome to ROCm SDK with tinyBLAS
link_cuda_dso: GPU support loaded
{"build":1500,"commit":"a30b324","function":"server_cli","level":"INFO","line":2898,"msg":"build info","tid":"11804224","timestamp":1730675653}
{"function":"server_cli","level":"INFO","line":2905,"msg":"system info","n_threads":8,"n_threads_batch":8,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | ","tid":"11804224","timestamp":1730675653,"total_threads":16}
llama_model_loader: loaded meta data with 29 key-value pairs and 579 tensors from qwen2.5-14b-instruct-q5_k_m.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = qwen2.5-14b-instruct
llama_model_loader: - kv   3:                            general.version str              = v0.1
llama_model_loader: - kv   4:                           general.finetune str              = qwen2.5-14b-instruct
llama_model_loader: - kv   5:                         general.size_label str              = 15B
llama_model_loader: - kv   6:                          qwen2.block_count u32              = 48
llama_model_loader: - kv   7:                       qwen2.context_length u32              = 131072
llama_model_loader: - kv   8:                     qwen2.embedding_length u32              = 5120
llama_model_loader: - kv   9:                  qwen2.feed_forward_length u32              = 13824
llama_model_loader: - kv  10:                 qwen2.attention.head_count u32              = 40
llama_model_loader: - kv  11:              qwen2.attention.head_count_kv u32              = 8
llama_model_loader: - kv  12:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  13:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  14:                          general.file_type u32              = 17
llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  16:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  17:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  19:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  22:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  23:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  25:               general.quantization_version u32              = 2
llama_model_loader: - kv  26:                                   split.no u16              = 0
llama_model_loader: - kv  27:                                split.count u16              = 0
llama_model_loader: - kv  28:                        split.tensors.count i32              = 579
llama_model_loader: - type  f32:  241 tensors
llama_model_loader: - type q5_K:  289 tensors
llama_model_loader: - type q6_K:   49 tensors
llm_load_vocab: special tokens cache size = 22
llm_load_vocab: token to piece cache size = 0.9310 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 152064
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 5120
llm_load_print_meta: n_layer          = 48
llm_load_print_meta: n_head           = 40
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 5
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 13824
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = Q5_K - Medium
llm_load_print_meta: model params     = 14.77 B
llm_load_print_meta: model size       = 9.78 GiB (5.69 BPW) 
llm_load_print_meta: general.name     = qwen2.5-14b-instruct
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: max token length = 256
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    yes
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon RX 7800 XT, compute capability 11.0, VMM: no
llm_load_tensors: ggml ctx size =    0.60 MiB
llm_load_tensors: offloading 48 repeating layers to GPU
llm_load_tensors: offloaded 48/49 layers to GPU
llm_load_tensors:      ROCm0 buffer size =  8896.78 MiB
llm_load_tensors:        CPU buffer size = 10016.35 MiB
..........................................................................................
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      ROCm0 KV buffer size =  1536.00 MiB
llama_new_context_with_model: KV self size  = 1536.00 MiB, K (f16):  768.00 MiB, V (f16):  768.00 MiB
llama_new_context_with_model:  ROCm_Host  output buffer size =     0.58 MiB
llama_new_context_with_model:      ROCm0 compute buffer size =   916.08 MiB
llama_new_context_with_model:  ROCm_Host compute buffer size =    26.01 MiB
llama_new_context_with_model: graph nodes  = 1686
llama_new_context_with_model: graph splits = 4
warming up the model with an empty run
{"function":"initialize","level":"INFO","line":502,"msg":"initializing slots","n_slots":1,"tid":"11804224","timestamp":1730675655}
{"function":"initialize","level":"INFO","line":514,"msg":"new slot","n_ctx_slot":8192,"slot_id":0,"tid":"11804224","timestamp":1730675655}
{"function":"server_cli","level":"INFO","line":3118,"msg":"model loaded","tid":"11804224","timestamp":1730675655}

llama server listening at http://127.0.0.1:8080

{"function":"server_cli","hostname":"127.0.0.1","level":"INFO","line":3256,"msg":"HTTP server listening","port":"8080","tid":"11804224","timestamp":1730675655,"url_prefix":""}
{"function":"update_slots","level":"INFO","line":1672,"msg":"all slots are idle and system prompt is empty, clear the KV cache","tid":"11804224","timestamp":1730675655}

$ sudo id -nG user2
user2 video

$./llamafile-qwen2.sh
{"function":"server_params_parse","level":"INFO","line":2724,"msg":"logging to file is disabled.","tid":"11804224","timestamp":1730675291}
import_cuda_impl: initializing gpu module...
link_cuda_dso: note: dynamically linking /home/user2/.llamafile/v/0.8.15/ggml-rocm.so
ccache: error: Could not find compiler "cc" in PATH
link_cuda_dso: warning: dlopen() isn't supported on this platform: failed to load library
get_nvcc_path: note: nvcc not found on $PATH
get_nvcc_path: note: $CUDA_PATH/bin/nvcc does not exist
get_nvcc_path: note: /opt/cuda/bin/nvcc does not exist
get_nvcc_path: note: /usr/local/cuda/bin/nvcc does not exist
link_cuda_dso: note: dynamically linking /home/user2/.llamafile/v/0.8.15/ggml-cuda.so
link_cuda_dso: warning: dlopen() isn't supported on this platform: failed to load library
warning: --n-gpu-layers 999 was passed but no GPUs were found; falling back to CPU inference
{"build":1500,"commit":"a30b324","function":"server_cli","level":"INFO","line":2898,"msg":"build info","tid":"11804224","timestamp":1730675291}
{"function":"server_cli","level":"INFO","line":2905,"msg":"system info","n_threads":8,"n_threads_batch":8,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | ","tid":"11804224","timestamp":1730675291,"total_threads":16}
llama_model_loader: loaded meta data with 29 key-value pairs and 579 tensors from qwen2.5-14b-instruct-q5_k_m.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = qwen2.5-14b-instruct
llama_model_loader: - kv   3:                            general.version str              = v0.1
llama_model_loader: - kv   4:                           general.finetune str              = qwen2.5-14b-instruct
llama_model_loader: - kv   5:                         general.size_label str              = 15B
llama_model_loader: - kv   6:                          qwen2.block_count u32              = 48
llama_model_loader: - kv   7:                       qwen2.context_length u32              = 131072
llama_model_loader: - kv   8:                     qwen2.embedding_length u32              = 5120
llama_model_loader: - kv   9:                  qwen2.feed_forward_length u32              = 13824
llama_model_loader: - kv  10:                 qwen2.attention.head_count u32              = 40
llama_model_loader: - kv  11:              qwen2.attention.head_count_kv u32              = 8
llama_model_loader: - kv  12:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  13:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  14:                          general.file_type u32              = 17
llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  16:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  17:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  19:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  22:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  23:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  25:               general.quantization_version u32              = 2
llama_model_loader: - kv  26:                                   split.no u16              = 0
llama_model_loader: - kv  27:                                split.count u16              = 0
llama_model_loader: - kv  28:                        split.tensors.count i32              = 579
llama_model_loader: - type  f32:  241 tensors
llama_model_loader: - type q5_K:  289 tensors
llama_model_loader: - type q6_K:   49 tensors
llm_load_vocab: special tokens cache size = 22
llm_load_vocab: token to piece cache size = 0.9310 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 152064
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 5120
llm_load_print_meta: n_layer          = 48
llm_load_print_meta: n_head           = 40
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 5
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 13824
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = Q5_K - Medium
llm_load_print_meta: model params     = 14.77 B
llm_load_print_meta: model size       = 9.78 GiB (5.69 BPW) 
llm_load_print_meta: general.name     = qwen2.5-14b-instruct
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    0.30 MiB
llm_load_tensors:        CPU buffer size = 10016.35 MiB
...........................................................................................
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =  1536.00 MiB
llama_new_context_with_model: KV self size  = 1536.00 MiB, K (f16):  768.00 MiB, V (f16):  768.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.58 MiB
llama_new_context_with_model:        CPU compute buffer size =   696.01 MiB
llama_new_context_with_model: graph nodes  = 1686
llama_new_context_with_model: graph splits = 1
warming up the model with an empty run
{"function":"initialize","level":"INFO","line":502,"msg":"initializing slots","n_slots":1,"tid":"11804224","timestamp":1730675299}
{"function":"initialize","level":"INFO","line":514,"msg":"new slot","n_ctx_slot":8192,"slot_id":0,"tid":"11804224","timestamp":1730675299}
{"function":"server_cli","level":"INFO","line":3118,"msg":"model loaded","tid":"11804224","timestamp":1730675299}

llama server listening at http://127.0.0.1:8080

In the sandboxing block!
{"function":"server_cli","hostname":"127.0.0.1","level":"INFO","line":3256,"msg":"HTTP server listening","port":"8080","tid":"11804224","timestamp":1730675299,"url_prefix":""}
{"function":"update_slots","level":"INFO","line":1672,"msg":"all slots are idle and system prompt is empty, clear the KV cache","tid":"11804224","timestamp":1730675299}