ggerganov / llama.cpp

LLM inference in C/C++
MIT License
61.3k stars 8.76k forks source link

vulkan backend failed to load models vk::Device::createComputePipeline: ErrorUnknown #6843

Open qtyandhasee opened 2 months ago

qtyandhasee commented 2 months ago

I am trying to cross-compile llama.cpp on x86 platform and move it to run on Android device (Adreno 740). On Android device, vulkan can recognize my device (GPU) but there is an load model error.

llama_model_load: error loading model: vk::Device::createComputePipeline: ErrorUnknown llama_load_model_from_file: failed to load model llama_init_from_gpt_params: error: failed to load model '/data/local/tmp/stories260K.gguf' main: error: unable to load model

I have checked the model path to make sure the model exists under the path. It has the conditions to be read successfully. I also tried the following models: llama-2-13b-chat.Q2_K.gguf llama-2-13b-chat.Q5_K_S.gguf llama-2-7b-chat.Q2_K.gguf stories260K.gguf

the way i build llama.cpp cmake .. -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake -DANDROID_ABI=arm64-v8a -DANDROID_NATIVE_API_LEVEL=33 -DLLAMA_VULKAN=1 -DCMAKE_C_FLAGS=-march=armv8.4a+dotprod+i8mm -DVulkan_INCLUDE_DIR=/home/smc/Downloads/Vulkan-Hpp-1.3.237 -DLLAMA_VULKAN_CHECK_RESULTS=1 -DLLAMA_VULKAN_DEBUG=1 -DLLAMA_VULKAN_VALIDATE=1 -DLLAMA_VULKAN_RUN_TESTS=1

make -j10

the way I run main.out Transfer the bin folder to the /data/local/tmp/llama directory on your Android device using scp

./bin/main -t 8 -m /data/local/tmp/stories260K.gguf --color -c 2048 -ngl 2 --temp 0.7 -n 128 -p "One day, Lily met"

uname -a Linux localhost 5.15.78-android13-8-g60893c660740-dirty #1 SMP PREEMPT Fri Jul 7 18:13:57 UTC 2023 aarch64 Toybox

GPU info Adreno (TM) 740

I want to know what I can do to solve this problem? Any suggestions? Thank you very much and look forward to your reply

qtyandhasee commented 2 months ago

detailed information :/data/local/tmp/llama-vulkan-test # ls bin :/data/local/tmp/llama-vulkan-test # chmod +x ./* tmp/stories260K.gguf --color -c 2048 -ngl 2 --temp 0.7 -n 128 -p "One day, Lily met" < Log start main: build = 3 (de46a4b) main: built with Android (11349228, +pgo, +bolt, +lto, -mlgo, based on r487747e) clang version 17.0.2 (https://android.googlesource.com/toolchain/llvm-project d9f89f4d16663d5012e5c09495f3b30ece3d2362) for x86_64-unknown-linux-gnu main: seed = 1713873819 llama_model_loader: loaded meta data with 19 key-value pairs and 48 tensors from /data/local/tmp/stories260K.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: tokenizer.ggml.tokens arr[str,512] = ["", "", "", "<0x00>", "<... llama_model_loader: - kv 1: tokenizer.ggml.scores arr[f32,512] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 2: tokenizer.ggml.token_type arr[i32,512] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 3: tokenizer.ggml.model str = llama llama_model_loader: - kv 4: general.architecture str = llama llama_model_loader: - kv 5: general.name str = llama llama_model_loader: - kv 6: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 7: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 8: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 9: tokenizer.ggml.seperator_token_id u32 = 4294967295 llama_model_loader: - kv 10: tokenizer.ggml.padding_token_id u32 = 4294967295 llama_model_loader: - kv 11: llama.context_length u32 = 128 llama_model_loader: - kv 12: llama.embedding_length u32 = 64 llama_model_loader: - kv 13: llama.feed_forward_length u32 = 172 llama_model_loader: - kv 14: llama.attention.head_count u32 = 8 llama_model_loader: - kv 15: llama.attention.head_count_kv u32 = 4 llama_model_loader: - kv 16: llama.block_count u32 = 5 llama_model_loader: - kv 17: llama.rope.dimension_count u32 = 8 llama_model_loader: - kv 18: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - type f32: 48 tensors llm_load_vocab: bad special token: 'tokenizer.ggml.seperator_token_id' = 4294967295d, using default id -1 llm_load_vocab: bad special token: 'tokenizer.ggml.padding_token_id' = 4294967295d, using default id -1 llm_load_vocab: special tokens definition check successful ( 259/512 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 512 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 128 llm_load_print_meta: n_embd = 64 llm_load_print_meta: n_head = 8 llm_load_print_meta: n_head_kv = 4 llm_load_print_meta: n_layer = 5 llm_load_print_meta: n_rot = 8 llm_load_print_meta: n_embd_head_k = 8 llm_load_print_meta: n_embd_head_v = 8 llm_load_print_meta: n_gqa = 2 llm_load_print_meta: n_embd_k_gqa = 32 llm_load_print_meta: n_embd_v_gqa = 32 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 172 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 128 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = ?B llm_load_print_meta: model ftype = all F32 (guessed) llm_load_print_meta: model params = 292.80 K llm_load_print_meta: model size = 1.12 MiB (32.00 BPW) llm_load_print_meta: general.name = llama llm_load_print_meta: BOS token = 1 '' llm_load_print_meta: EOS token = 2 '' llm_load_print_meta: UNK token = 0 '' llm_load_print_meta: LF token = 13 '<0x0A>' ggml_vk_instance_init() ggml_vulkan: WARNING: Instance extension VK_KHR_portability_enumeration not found. ggml_vulkan: Found 1 Vulkan devices: ggml_vk_print_gpu_info(0) Vulkan0: Adreno (TM) 740 | uma: 1 | fp16: 1 | warp size: 64 ggml_backend_vk_buffer_type(0) ggml_backend_vk_init(0) ggml_vk_init(, 0) ggml_vk_get_device(0) Initializing new vk_device ggml_vk_find_queue_family_index() ggml_vk_find_queue_family_index() ggml_vk_create_queue() ggml_vk_load_shaders() ggml_vk_create_pipeline(matmul_f32_l, main, 3, 56, (128,128,1), specialization_constants, 1) ggml_vk_create_pipeline(matmul_f32_m, main, 3, 56, (64,64,1), specialization_constants, 1) ggml_vk_create_pipeline(matmul_f32_s, main, 3, 56, (32,32,1), specialization_constants, 1) ggml_vk_create_pipeline(matmul_f32_aligned_l, main, 3, 56, (128,128,1), specialization_constants, 128) ggml_vk_create_pipeline(matmul_f32_aligned_m, main, 3, 56, (64,64,1), specialization_constants, 64) ggml_vk_create_pipeline(matmul_f32_aligned_s, main, 3, 56, (32,32,1), specialization_constants, 32) ggml_vk_create_pipeline(matmul_f16_l, main, 3, 56, (128,128,1), specialization_constants, 1) ggml_vk_create_pipeline(matmul_f16_m, main, 3, 56, (64,64,1), specialization_constants, 1) ggml_vk_create_pipeline(matmul_f16_s, main, 3, 56, (32,32,1), specialization_constants, 1) ggml_vk_create_pipeline(matmul_f16_aligned_l, main, 3, 56, (128,128,1), specialization_constants, 128) ggml_vk_create_pipeline(matmul_f16_aligned_m, main, 3, 56, (64,64,1), specialization_constants, 64) ggml_vk_create_pipeline(matmul_f16_aligned_s, main, 3, 56, (32,32,1), specialization_constants, 32) ggml_vk_create_pipeline(matmul_f16_f32_l, main, 3, 56, (128,128,1), specialization_constants, 1) ggml_vk_create_pipeline(matmul_f16_f32_m, main, 3, 56, (64,64,1), specialization_constants, 1) ggml_vk_create_pipeline(matmul_f16_f32_s, main, 3, 56, (32,32,1), specialization_constants, 1) ggml_vk_create_pipeline(matmul_f16_f32_aligned_l, main, 3, 56, (128,128,1), specialization_constants, 128) ggml_vk_create_pipeline(matmul_f16_f32_aligned_m, main, 3, 56, (64,64,1), specialization_constants, 64) ggml_vk_create_pipeline(matmul_f16_f32_aligned_s, main, 3, 56, (32,32,1), specialization_constants, 32) ggml_vk_create_pipeline(matmul_q4_0_f32_l, main, 3, 56, (128,128,1), specialization_constants, 128) ggml_vk_create_pipeline(matmul_q4_0_f32_m, main, 3, 56, (64,64,1), specialization_constants, 64) ggml_vk_create_pipeline(matmul_q4_0_f32_s, main, 3, 56, (32,32,1), specialization_constants, 32) ggml_vk_create_pipeline(matmul_q4_0_f32_aligned_l, main, 3, 56, (128,128,1), specialization_constants, 128) ggml_vk_create_pipeline(matmul_q4_0_f32_aligned_m, main, 3, 56, (64,64,1), specialization_constants, 64) ggml_vk_create_pipeline(matmul_q4_0_f32_aligned_s, main, 3, 56, (32,32,1), specialization_constants, 32) ggml_vk_create_pipeline(matmul_q4_0_f32_l, main, 3, 56, (128,128,1), specialization_constants, 128) ggml_vk_create_pipeline(matmul_q4_0_f32_m, main, 3, 56, (64,64,1), specialization_constants, 64) ggml_vk_create_pipeline(matmul_q4_0_f32_s, main, 3, 56, (32,32,1), specialization_constants, 32) ggml_vk_create_pipeline(matmul_q4_0_f32_aligned_l, main, 3, 56, (128,128,1), specialization_constants, 128) ggml_vk_create_pipeline(matmul_q4_0_f32_aligned_m, main, 3, 56, (64,64,1), specialization_constants, 64) ggml_vk_create_pipeline(matmul_q4_0_f32_aligned_s, main, 3, 56, (32,32,1), specialization_constants, 32) ggml_vk_create_pipeline(matmul_q5_0_f32_l, main, 3, 56, (128,128,1), specialization_constants, 128) ggml_vk_create_pipeline(matmul_q5_0_f32_m, main, 3, 56, (64,64,1), specialization_constants, 64) ggml_vk_create_pipeline(matmul_q5_0_f32_s, main, 3, 56, (32,32,1), specialization_constants, 32) ggml_vk_create_pipeline(matmul_q5_0_f32_aligned_l, main, 3, 56, (128,128,1), specialization_constants, 128) ggml_vk_create_pipeline(matmul_q5_0_f32_aligned_m, main, 3, 56, (64,64,1), specialization_constants, 64) ggml_vk_create_pipeline(matmul_q5_0_f32_aligned_s, main, 3, 56, (32,32,1), specialization_constants, 32) ggml_vk_create_pipeline(matmul_q5_1_f32_l, main, 3, 56, (128,128,1), specialization_constants, 128) ggml_vk_create_pipeline(matmul_q5_1_f32_m, main, 3, 56, (64,64,1), specialization_constants, 64) ggml_vk_create_pipeline(matmul_q5_1_f32_s, main, 3, 56, (32,32,1), specialization_constants, 32) ggml_vk_create_pipeline(matmul_q5_1_f32_aligned_l, main, 3, 56, (128,128,1), specialization_constants, 128) ggml_vk_create_pipeline(matmul_q5_1_f32_aligned_m, main, 3, 56, (64,64,1), specialization_constants, 64) ggml_vk_create_pipeline(matmul_q5_1_f32_aligned_s, main, 3, 56, (32,32,1), specialization_constants, 32) ggml_vk_create_pipeline(matmul_q8_0_f32_l, main, 3, 56, (128,128,1), specialization_constants, 128) ggml_vk_create_pipeline(matmul_q8_0_f32_m, main, 3, 56, (64,64,1), specialization_constants, 64) ggml_vk_create_pipeline(matmul_q8_0_f32_s, main, 3, 56, (32,32,1), specialization_constants, 32) ggml_vk_create_pipeline(matmul_q8_0_f32_aligned_l, main, 3, 56, (128,128,1), specialization_constants, 128) ggml_vk_create_pipeline(matmul_q8_0_f32_aligned_m, main, 3, 56, (64,64,1), specialization_constants, 64) ggml_vk_create_pipeline(matmul_q8_0_f32_aligned_s, main, 3, 56, (32,32,1), specialization_constants, 32) ggml_vk_create_pipeline(matmul_q2_k_f32_l, main, 3, 56, (128,128,1), specialization_constants, 128) ggml_vk_create_pipeline(matmul_q2_k_f32_m, main, 3, 56, (64,64,1), specialization_constants, 64) ggml_vk_create_pipeline(matmul_q2_k_f32_s, main, 3, 56, (32,32,1), specialization_constants, 32) ggml_vk_create_pipeline(matmul_q2_k_f32_aligned_l, main, 3, 56, (128,128,1), specialization_constants, 128) ggml_vk_create_pipeline(matmul_q2_k_f32_aligned_m, main, 3, 56, (64,64,1), specialization_constants, 64) ggml_vk_create_pipeline(matmul_q2_k_f32_aligned_s, main, 3, 56, (32,32,1), specialization_constants, 32) ggml_vk_create_pipeline(matmul_q3_k_f32_l, main, 3, 56, (128,128,1), specialization_constants, 128) ggml_vk_create_pipeline(matmul_q3_k_f32_m, main, 3, 56, (64,64,1), specialization_constants, 64) ggml_vk_create_pipeline(matmul_q3_k_f32_s, main, 3, 56, (32,32,1), specialization_constants, 32) ggml_vk_create_pipeline(matmul_q3_k_f32_aligned_l, main, 3, 56, (128,128,1), specialization_constants, 128) ggml_vk_create_pipeline(matmul_q3_k_f32_aligned_m, main, 3, 56, (64,64,1), specialization_constants, 64) ggml_vk_create_pipeline(matmul_q3_k_f32_aligned_s, main, 3, 56, (32,32,1), specialization_constants, 32) ggml_vk_create_pipeline(matmul_q4_k_f32_l, main, 3, 56, (128,128,1), specialization_constants, 128) llama_model_load: error loading model: vk::Device::createComputePipeline: ErrorUnknown llama_load_model_from_file: failed to load model llama_init_from_gpt_params: error: failed to load model '/data/local/tmp/stories260K.gguf' main: error: unable to load model destroy device Adreno (TM) 740 ggml_pipeline_destroy_pipeline(matmul_f32_l) ggml_pipeline_destroy_pipeline(matmul_f32_m) ggml_pipeline_destroy_pipeline(matmul_f32_s) ggml_pipeline_destroy_pipeline(matmul_f32_aligned_l) ggml_pipeline_destroy_pipeline(matmul_f32_aligned_m) ggml_pipeline_destroy_pipeline(matmul_f32_aligned_s) ggml_pipeline_destroy_pipeline(matmul_f16_l) ggml_pipeline_destroy_pipeline(matmul_f16_m) ggml_pipeline_destroy_pipeline(matmul_f16_s) ggml_pipeline_destroy_pipeline(matmul_f16_aligned_l) ggml_pipeline_destroy_pipeline(matmul_f16_aligned_m) ggml_pipeline_destroy_pipeline(matmul_f16_aligned_s) ggml_pipeline_destroy_pipeline(matmul_f16_f32_l) ggml_pipeline_destroy_pipeline(matmul_f16_f32_m) ggml_pipeline_destroy_pipeline(matmul_f16_f32_s) ggml_pipeline_destroy_pipeline(matmul_f16_f32_aligned_l) ggml_pipeline_destroy_pipeline(matmul_f16_f32_aligned_m) ggml_pipeline_destroy_pipeline(matmul_f16_f32_aligned_s) ggml_pipeline_destroy_pipeline(matmul_q4_0_f32_l) ggml_pipeline_destroy_pipeline(matmul_q4_0_f32_m) ggml_pipeline_destroy_pipeline(matmul_q4_0_f32_s) ggml_pipeline_destroy_pipeline(matmul_q4_0_f32_aligned_l) ggml_pipeline_destroy_pipeline(matmul_q4_0_f32_aligned_m) ggml_pipeline_destroy_pipeline(matmul_q4_0_f32_aligned_s) ggml_pipeline_destroy_pipeline(matmul_q4_0_f32_l) ggml_pipeline_destroy_pipeline(matmul_q4_0_f32_m) ggml_pipeline_destroy_pipeline(matmul_q4_0_f32_s) ggml_pipeline_destroy_pipeline(matmul_q4_0_f32_aligned_l) ggml_pipeline_destroy_pipeline(matmul_q4_0_f32_aligned_m) ggml_pipeline_destroy_pipeline(matmul_q4_0_f32_aligned_s) ggml_pipeline_destroy_pipeline(matmul_q5_0_f32_l) ggml_pipeline_destroy_pipeline(matmul_q5_0_f32_m) ggml_pipeline_destroy_pipeline(matmul_q5_0_f32_s) ggml_pipeline_destroy_pipeline(matmul_q5_0_f32_aligned_l) ggml_pipeline_destroy_pipeline(matmul_q5_0_f32_aligned_m) ggml_pipeline_destroy_pipeline(matmul_q5_0_f32_aligned_s) ggml_pipeline_destroy_pipeline(matmul_q5_1_f32_l) ggml_pipeline_destroy_pipeline(matmul_q5_1_f32_m) ggml_pipeline_destroy_pipeline(matmul_q5_1_f32_s) ggml_pipeline_destroy_pipeline(matmul_q5_1_f32_aligned_l) ggml_pipeline_destroy_pipeline(matmul_q5_1_f32_aligned_m) ggml_pipeline_destroy_pipeline(matmul_q5_1_f32_aligned_s) ggml_pipeline_destroy_pipeline(matmul_q8_0_f32_l) ggml_pipeline_destroy_pipeline(matmul_q8_0_f32_m) ggml_pipeline_destroy_pipeline(matmul_q8_0_f32_s) ggml_pipeline_destroy_pipeline(matmul_q8_0_f32_aligned_l) ggml_pipeline_destroy_pipeline(matmul_q8_0_f32_aligned_m) ggml_pipeline_destroy_pipeline(matmul_q8_0_f32_aligned_s) ggml_pipeline_destroy_pipeline(matmul_q2_k_f32_l) ggml_pipeline_destroy_pipeline(matmul_q2_k_f32_m) ggml_pipeline_destroy_pipeline(matmul_q2_k_f32_s) ggml_pipeline_destroy_pipeline(matmul_q2_k_f32_aligned_l) ggml_pipeline_destroy_pipeline(matmul_q2_k_f32_aligned_m) ggml_pipeline_destroy_pipeline(matmul_q2_k_f32_aligned_s) ggml_pipeline_destroy_pipeline(matmul_q3_k_f32_l) ggml_pipeline_destroy_pipeline(matmul_q3_k_f32_m) ggml_pipeline_destroy_pipeline(matmul_q3_k_f32_s) ggml_pipeline_destroy_pipeline(matmul_q3_k_f32_aligned_l) ggml_pipeline_destroy_pipeline(matmul_q3_k_f32_aligned_m) ggml_pipeline_destroy_pipeline(matmul_q3_k_f32_aligned_s)

smilingOrange commented 2 months ago

Did you try to build the llama.android example app?

qtyandhasee commented 2 months ago

Did you try to build the llama.android example app?

@smilingOrange Not really, because I'm not using termux and Android studio, I cross-compile llama.cpp using the NDK and then transfer the corresponding product to my Android device (Qualcomm GPU) via scp, which I've proven works. Because when I don't use the vulkan back end but use blast for cross-compilation, I can barely get Q2 quantized big models running. But since I'm new to vulkan, I'm not sure why vulkan can recognize the GPU in my device when cross-compiling with the vulkan back end, but can't load the model. I'm curious about how to solve this problem, and feel free to let me know if you have any ideas.

hans00 commented 1 month ago

In my debug. The compute shaders for Q4_K and Q5_K are unsupported on Qualcomm Adreno. Without these, it will work.

For more info: Failed shaders are matmul_q4_k_f32_l matmul_q4_k_f32_m matmul_q4_k_f32_s matmul_q4_k_f32_aligned_l matmul_q4_k_f32_aligned_m matmul_q4_k_f32_aligned_e matmul_q5_k_f32_l matmul_q5_k_f32_m matmul_q5_k_f32_s matmul_q5_k_f32_aligned_l matmul_q5_k_f32_aligned_m matmul_q5_k_f32_aligned_s dequant_q4_K dequant_q5_K

ElemenTP commented 1 week ago

Using llama.cpp's vulkan backend with Adreno gpus will be buggy. refer:https://github.com/ggerganov/llama.cpp/issues/5186#issuecomment-1960126390