ggerganov / llama.cpp

LLM inference in C/C++
MIT License
68.01k stars 9.75k forks source link

Fails to run in SYCL mode #6528

Closed barolo closed 4 months ago

barolo commented 7 months ago

Using example script


:: initializing oneAPI environment ...
   run-llama2.sh: BASH_VERSION = 5.2.26(1)-release
   args: Using "$@" for setvars.sh arguments: 
:: advisor -- latest
:: ccl -- latest
:: compiler -- latest
:: dal -- latest
:: debugger -- latest
:: dev-utilities -- latest
:: dnnl -- latest
:: dpcpp-ct -- latest
:: dpl -- latest
:: ipp -- latest
:: ippcp -- latest
:: mkl -- latest
:: mpi -- latest
:: tbb -- latest
:: vtune -- latest
:: oneAPI environment initialized ::

./run-llama2.sh: line 22: [: -eq: unary operator expected
Log start
main: build = 2624 (c3724779)
main: built with Intel(R) oneAPI DPC++/C++ Compiler 2024.1.0 (2024.1.0.20240308) for x86_64-unknown-linux-gnu
main: seed  = 0
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from /home/greggy/Develo/D/models/llama-2-7b.Q4_K_S.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0,000010
llama_model_loader: - kv  10:                          general.file_type u32              = 14
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,32000]   = [0,000000, 0,000000, 0,000000, 0,0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  17:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  18:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_K:  217 tensors
llama_model_loader: - type q5_K:    8 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0,0e+00
llm_load_print_meta: f_norm_rms_eps   = 1,0e-05
llm_load_print_meta: f_clamp_kqv      = 0,0e+00
llm_load_print_meta: f_max_alibi_bias = 0,0e+00
llm_load_print_meta: f_logit_scale    = 0,0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000,0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_K - Small
llm_load_print_meta: model params     = 6,74 B
llm_load_print_meta: model size       = 3,59 GiB (4,58 BPW) 
llm_load_print_meta: general.name     = LLaMA v2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
[SYCL] call ggml_init_sycl
ggml_init_sycl: GGML_SYCL_DEBUG: 1
ggml_init_sycl: GGML_SYCL_F16: no
[SYCL] call ggml_backend_sycl_print_sycl_devices
found 4 SYCL devices:
|  |                  |                                             |Compute   |Max compute|Max work|Max sub|               |
|ID|       Device Type|                                         Name|capability|units      |group   |group  |Global mem size|
|--|------------------|---------------------------------------------|----------|-----------|--------|-------|---------------|
| 0|[level_zero:gpu:0]|                 Intel(R) Iris(R) Xe Graphics|       1.3|         80|     512|     32|    14919602176|
| 1|    [opencl:gpu:0]|                 Intel(R) Iris(R) Xe Graphics|       3.0|         80|     512|     32|    14919602176|
| 2|    [opencl:cpu:0]|          12th Gen Intel(R) Core(TM) i5-1240P|       3.0|         16|    8192|     64|    16373895168|
| 3|    [opencl:acc:0]|               Intel(R) FPGA Emulation Device|       1.2|         16|67108864|     64|    16373895168|
[SYCL] call ggml_backend_sycl_set_mul_device_mode
ggml_backend_sycl_set_mul_device_mode: true
detect 1 SYCL GPUs: [0] with top Max compute units:80
[SYCL] call ggml_backend_sycl_host_buffer_type
[SYCL] call ggml_backend_sycl_get_device_count
[SYCL] call ggml_backend_sycl_get_device_memory
[SYCL] call ggml_backend_sycl_buffer_type
[SYCL] call ggml_backend_sycl_buffer_type
[SYCL] call ggml_backend_sycl_buffer_type
[SYCL] call ggml_backend_sycl_buffer_type
[SYCL] call ggml_backend_sycl_buffer_type
[SYCL] call ggml_backend_sycl_buffer_type
[SYCL] call ggml_backend_sycl_buffer_type
[SYCL] call ggml_backend_sycl_buffer_type
[SYCL] call ggml_backend_sycl_buffer_type
[SYCL] call ggml_backend_sycl_buffer_type
[SYCL] call ggml_backend_sycl_buffer_type
[SYCL] call ggml_backend_sycl_buffer_type
[SYCL] call ggml_backend_sycl_buffer_type
[SYCL] call ggml_backend_sycl_buffer_type
[SYCL] call ggml_backend_sycl_buffer_type
[SYCL] call ggml_backend_sycl_buffer_type
[SYCL] call ggml_backend_sycl_buffer_type
[SYCL] call ggml_backend_sycl_buffer_type
[SYCL] call ggml_backend_sycl_buffer_type
[SYCL] call ggml_backend_sycl_buffer_type
[SYCL] call ggml_backend_sycl_buffer_type
[SYCL] call ggml_backend_sycl_buffer_type
[SYCL] call ggml_backend_sycl_buffer_type
[SYCL] call ggml_backend_sycl_buffer_type
[SYCL] call ggml_backend_sycl_buffer_type
[SYCL] call ggml_backend_sycl_buffer_type
[SYCL] call ggml_backend_sycl_buffer_type
[SYCL] call ggml_backend_sycl_buffer_type
[SYCL] call ggml_backend_sycl_buffer_type
[SYCL] call ggml_backend_sycl_buffer_type
[SYCL] call ggml_backend_sycl_buffer_type
[SYCL] call ggml_backend_sycl_buffer_type
[SYCL] call ggml_backend_sycl_buffer_type
llm_load_tensors: ggml ctx size =    0,22 MiB
[SYCL] call ggml_backend_sycl_host_buffer_type
[SYCL] call ggml_backend_sycl_host_buffer_type
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:      SYCL0 buffer size =  3607,06 MiB
llm_load_tensors:        CPU buffer size =    70,31 MiB
..................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 10000,0
llama_new_context_with_model: freq_scale = 1
[SYCL] call ggml_backend_sycl_get_device_count
[SYCL] call ggml_backend_sycl_init
Using device 0 (Intel(R) Iris(R) Xe Graphics) as main device
[SYCL] call ggml_backend_sycl_get_device_count
llama_kv_cache_init:      SYCL0 KV buffer size =   256,00 MiB
llama_new_context_with_model: KV self size  =  256,00 MiB, K (f16):  128,00 MiB, V (f16):  128,00 MiB
[SYCL] call ggml_backend_sycl_host_buffer_type
llama_new_context_with_model:  SYCL_Host  output buffer size =     0,12 MiB
[SYCL] call ggml_backend_sycl_buffer_type
[SYCL] call ggml_backend_sycl_host_buffer_type
[SYCL] call ggml_backend_sycl_get_device_count
ggml_gallocr_reserve_n: reallocating SYCL0 buffer from size 0,00 MiB to 70,50 MiB
ggml_gallocr_reserve_n: reallocating SYCL_Host buffer from size 0,00 MiB to 9,01 MiB
llama_new_context_with_model:      SYCL0 compute buffer size =    70,50 MiB
llama_new_context_with_model:  SYCL_Host compute buffer size =     9,01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 2
[SYCL] call ggml_backend_sycl_buffer_type
[SYCL] call ggml_backend_sycl_buffer_type
[SYCL] call ggml_backend_sycl_buffer_type
call ggml_sycl_rms_norm
Unexpected pattern!
UNREACHABLE executed at /dev-util/spirv-llvm-translator-15.0.0-r1/work/SPIRV-LLVM-Translator-15.0.0/lib/SPIRV/SPIRVUtil.cpp:2037!
The program was built for 1 devices
Build program log for 'Intel(R) Iris(R) Xe Graphics':
 -11 (PI_ERROR_BUILD_PROGRAM_FAILURE)Exception caught at file:/D/llama.cpp/ggml-sycl.cpp, line:14897
NeoZhangJianyu commented 7 months ago

@barolo Could you try with example mode file: llama-2-7b.Q4_0.gguf? It will help check the soft/hard ware in your PC.

I'm not sure it working well with llama-2-7b.Q4_K_S.gguf in your case. Maybe you could try with latest code. More issues about IQ4,3,2,1 data types are fixed recently.

barolo commented 7 months ago

@barolo Could you try with example mode file: llama-2-7b.Q4_0.gguf? It will help check the soft/hard ware in your PC.

I'm not sure it working well with llama-2-7b.Q4_K_S.gguf in your case. Maybe you could try with latest code. More issues about IQ4,3,2,1 data types are fixed recently.

I've used one of the models referred by the docs Alternatively, if you want to save time and space, you can download already converted and quantized models from [TheBloke](https://huggingface.co/TheBloke)...

I'm running latest code, which you can tell from the commit in the log.

barolo commented 7 months ago

@barolo Could you try with example mode file: llama-2-7b.Q4_0.gguf? It will help check the soft/hard ware in your PC.

I'm not sure it working well with llama-2-7b.Q4_K_S.gguf in your case. Maybe you could try with latest code. More issues about IQ4,3,2,1 data types are fixed recently.

The error is identical with "default" model [which was pita to get] and llama freshly built from git.

NeoZhangJianyu commented 7 months ago

@barolo 1. Could you share the whole log?

  1. sycl-ls share the output.

Thank you!

barolo commented 7 months ago

@barolo 1. Could you share the whole log?

2. sycl-ls
   share the output.

Thank you!

What do you mean by the 'whole log' ?

sycl-ls
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2024.17.3.0.08_160000]
[opencl:cpu:1] Intel(R) OpenCL, 12th Gen Intel(R) Core(TM) i5-1240P OpenCL 3.0 (Build 0) [2024.17.3.0.08_160000]
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Iris(R) Xe Graphics OpenCL 3.0 NEO  [24.05.028454]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Iris(R) Xe Graphics 1.3 [1.3.28454]
NeoZhangJianyu commented 7 months ago

the log include the content from input cmd to final error appear.

barolo commented 7 months ago

the log include the content from input cmd to final error appear.

That's what I did when I posted the issue?

NeoZhangJianyu commented 7 months ago

Your fault log is about model: /home/greggy/Develo/D/models/llama-2-7b.Q4_K_S.gguf Could you try llama-2-7b.Q4_0.gguf? I want to confirm your issue is about model or hardware/software env.

dikei100 commented 6 months ago

Same issue here, with llama-2-7b.Q4_0.gguf

user@Notebook:~/llama.cpp$ ZES_ENABLE_SYSMAN=1 ./build/bin/main -m models/llama-2-7b.Q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 33 -sm none -mg 0
Log start
main: build = 2967 (b18532a4)
main: built with Intel(R) oneAPI DPC++/C++ Compiler 2024.1.0 (2024.1.0.20240308) for x86_64-unknown-linux-gnu
main: seed  = 1716394215
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from models/llama-2-7b.Q4_0.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0,000010
llama_model_loader: - kv  10:                          general.file_type u32              = 2
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,32000]   = [0,000000, 0,000000, 0,000000, 0,0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  17:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  18:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0,0e+00
llm_load_print_meta: f_norm_rms_eps   = 1,0e-05
llm_load_print_meta: f_clamp_kqv      = 0,0e+00
llm_load_print_meta: f_max_alibi_bias = 0,0e+00
llm_load_print_meta: f_logit_scale    = 0,0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000,0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 6,74 B
llm_load_print_meta: model size       = 3,56 GiB (4,54 BPW) 
llm_load_print_meta: general.name     = LLaMA v2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
[SYCL] call ggml_init_sycl
ggml_init_sycl: GGML_SYCL_DEBUG: 0
ggml_init_sycl: GGML_SYCL_F16: no
found 4 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                     Intel Arc Graphics|    1.3|    128|    1024|   32| 30518M|            1.3.28717|
| 1|     [opencl:gpu:0]|                     Intel Arc Graphics|    3.0|    128|    1024|   32| 30518M|       24.09.28717.17|
| 2|     [opencl:cpu:0]|                Intel Core Ultra 7 155H|    3.0|     22|    8192|   64| 32968M|2024.17.3.0.08_160000|
| 3|     [opencl:acc:0]|            Intel FPGA Emulation Device|    1.2|     22|67108864|   64| 32968M|2024.17.3.0.08_160000|
ggml_backend_sycl_set_single_device: use single device: [0]
use 1 SYCL GPUs: [0] with Max compute units:128
llm_load_tensors: ggml ctx size =    0,30 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:      SYCL0 buffer size =  3577,56 MiB
llm_load_tensors:        CPU buffer size =    70,31 MiB
..................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000,0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      SYCL0 KV buffer size =   256,00 MiB
llama_new_context_with_model: KV self size  =  256,00 MiB, K (f16):  128,00 MiB, V (f16):  128,00 MiB
llama_new_context_with_model:  SYCL_Host  output buffer size =     0,12 MiB
llama_new_context_with_model:      SYCL0 compute buffer size =    70,50 MiB
llama_new_context_with_model:  SYCL_Host compute buffer size =     9,01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 2
The program was built for 1 devices
Build program log for 'Intel(R) Arc(TM) Graphics':
 -11 (PI_ERROR_BUILD_PROGRAM_FAILURE)Exception caught at file:/home/user/llama.cpp/ggml-sycl.cpp, line:14836
NeoZhangJianyu commented 5 months ago

@dikei100 What's the hardware info of your case? I guess it's MTL Arc iGPU. What's the OS?

I don't reproduce your case. But my driver is 1.3.28202, your is 1.3.28717.

Chimrod commented 5 months ago

Hello, I also have this issue, using debian.

> Unexpected pattern!
UNREACHABLE executed at ./lib/SPIRV/SPIRVUtil.cpp:1887!
The program was built for 1 devices
Build program log for 'Intel(R) Arc(TM) A770 Graphics':
 -11 (PI_ERROR_BUILD_PROGRAM_FAILURE)Exception caught at file:/home/sebastien/Projets/llama.cpp/ggml-sycl.cpp, line:14368

I’m using debian stable and the latest drivers privided by intel, this is the output of syscl-ls:

$ sycl-ls
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2024.17.5.0.08_160000.xmain-hotfix]
[opencl:cpu:1] Intel(R) OpenCL, AMD Ryzen 9 7950X 16-Core Processor             OpenCL 3.0 (Build 0) [2024.17.5.0.08_160000.xmain-hotfix]
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics OpenCL 3.0 NEO  [24.13.029138]
[opencl:cpu:3] Intel(R) OpenCL, AMD Ryzen 9 7950X 16-Core Processor             OpenCL 3.0 (Build 0) [2024.17.5.0.08_160000.xmain-hotfix]
[opencl:acc:4] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2024.17.5.0.08_160000.xmain-hotfix]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.29138]
Chimrod commented 5 months ago

The driver version here comes from the package provided by debian and there is no other version provided.

dikei100 commented 5 months ago

@dikei100 What's the hardware info of your case? I guess it's MTL Arc iGPU. What's the OS?

I don't reproduce your case. But my driver is 1.3.28202, your is 1.3.28717.

@NeoZhangJianyu Yes, I am using a Notebook with Meteor Lake Intel Core i7 Ultra 155H CPU with the integrated Arc iGPU. OS is Fedora 40 with Plasma KDE Desktop Environment.

NeoZhangJianyu commented 5 months ago

I borrow a laptop with Meteor Lake with latest windows driver: 1.3.29283. It's OK. But I have no Linux Meteor Lake to verify it.

I will verify it on Arc 770 on Ubuntu 22.04 with latest driver. Looks like the issue is about driver or level-zero.

Could you try with latest GPU driver or older driver?

There are several similar cases to be fixed by update the GPU driver (rollback).

PhilippeRo commented 5 months ago

I do also have the same problem. I'm on fedora 40. Some information about the setup below. I can provide more if need be.

The final error: The program was built for 1 devices Build program log for 'Intel(R) Iris(R) Xe Graphics': -11 (PI_ERROR_BUILD_PROGRAM_FAILURE)Exception caught at file:/home/philippe/Téléchargements/llama.cpp/ggml-sycl.cpp, line:14368

llama.cpp/build-syscl$ bin/ls-sycl-device found 4 SYCL devices:

ID Device Type Name Version units group group size Driver version
0 [level_zero:gpu:0] Intel Iris Xe Graphics 1.3 96 512 32 14678M 1.3.28717
1 [opencl:gpu:0] Intel Iris Xe Graphics 3.0 96 512 32 14678M 24.09.28717.17
2 [opencl:cpu:0] 11th Gen Intel Core i7-1165G7 @ 2.80GHz 3.0 8 8192 64 16117M 2024.17.3.0.08_160000
3 [opencl:acc:0] Intel FPGA Emulation Device 1.2 8 67108864 64 16117M 2024.17.3.0.08_160000

llama.cpp/build-syscl$ rpm -qa|grep level-zero intel-level-zero-24.09.28717.17-1.fc40.x86_64 oneapi-level-zero-1.16.1-1.fc40.x86_64 oneapi-level-zero-devel-1.16.1-1.fc40.x86_64

llama.cpp/build-syscl$ rpm -qa|grep compute intel-compute-runtime-24.09.28717.17-1.fc40.x86_64

NeoZhangJianyu commented 5 months ago

What's the oneAPI base toolkit version? Recommend 2024.1 (latest).

PhilippeRo commented 5 months ago

It is 2024.1 as far as I can tell. $rpm -q intel-basekit intel-basekit-2024.1.0-589.x86_64

Chimrod commented 5 months ago

Same here:

$ apt list --installed | grep intel-basekit
intel-basekit-env-2024.1/all,now 2024.1.0-589 all [installed,automatic]
intel-basekit-getting-started-2024.1/all,now 2024.1.0-589 all [installed,automatic]
intel-basekit/all,now 2024.1.0-589 amd64 [installed]
NeoZhangJianyu commented 5 months ago

I see it. I guess it's about driver or level-zero running time issue. I have no debian or fedora MTL PC to check this issue.

My suggestion is change the driver and level-zero one by one (newer or older). The code can do nothing for this issue.

barolo commented 5 months ago

I see it. I guess it's about driver or level-zero running time issue. I have no debian or fedora MTL PC to check this issue.

My suggestion is change the driver and level-zero one by one (newer or older). The code can do nothing for this issue.

What do you have? It would be helpful to know what it is supposed to work with [Windows excluded].

github-actions[bot] commented 4 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.

Chimrod commented 4 months ago

Hello, I’ve seen the issue was automatically closed, but the issue persist.

found 3 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                Intel Arc A770 Graphics|    1.3|    512|    1024|   32| 16225M|            1.3.29735|
| 1|     [opencl:gpu:0]|                Intel Arc A770 Graphics|    3.0|    512|    1024|   32| 16225M|         24.22.029735|
| 2|     [opencl:cpu:0]|AMD Ryzen 9 7950X 16-Core Processor            |    3.0|     32|    8192|   64| 32824M|2024.18.6.0.02_160000|
llama_kv_cache_init:      SYCL1 KV buffer size =  1024,00 MiB
llama_new_context_with_model: KV self size  = 1024,00 MiB, K (f16):  512,00 MiB, V (f16):  512,00 MiB
llama_new_context_with_model:  SYCL_Host  output buffer size =     0,49 MiB
llama_new_context_with_model:      SYCL1 compute buffer size =   560,00 MiB
llama_new_context_with_model:  SYCL_Host compute buffer size =    24,01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 2
Unexpected pattern!
UNREACHABLE executed at ./lib/SPIRV/SPIRVUtil.cpp:1887!
The program was built for 1 devices
Build program log for 'Intel(R) Arc(TM) A770 Graphics':
IGC: Internal Compiler Error: Abnormal termination -11 (PI_ERROR_BUILD_PROGRAM_FAILURE)Exception caught at file:/home/sebastien/Projets/llama.cpp/ggml/src/ggml-sycl.cpp, line:2885

I’ve upgraded libze to the lastest binary proposed by debian (24.22.29735.21) but there is no change here.

NeoZhangJianyu commented 4 months ago

@Chimrod I see you use the device [opencl:gpu:0]. OpenCL has some issue. Please use level-zero device: SYCL0: [level_zero:gpu:0].

Chimrod commented 4 months ago

Sorry, I’ve made differents tests, with the availables gpu, and the latest copy/paste didn’t use sycl.

This is the result using --main-gpu 0:

[SYCL] call ggml_check_sycl
ggml_check_sycl: GGML_SYCL_DEBUG: 0
ggml_check_sycl: GGML_SYCL_F16: no
found 3 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                Intel Arc A770 Graphics|    1.3|    512|    1024|   32| 16225M|            1.3.29735|
| 1|     [opencl:gpu:0]|                Intel Arc A770 Graphics|    3.0|    512|    1024|   32| 16225M|         24.22.029735|
| 2|     [opencl:cpu:0]|AMD Ryzen 9 7950X 16-Core Processor            |    3.0|     32|    8192|   64| 32824M|2024.18.6.0.02_160000|
llama_kv_cache_init:      SYCL0 KV buffer size =  1024,00 MiB
llama_new_context_with_model: KV self size  = 1024,00 MiB, K (f16):  512,00 MiB, V (f16):  512,00 MiB
llama_new_context_with_model:  SYCL_Host  output buffer size =     0,49 MiB
llama_new_context_with_model:      SYCL0 compute buffer size =   560,00 MiB
llama_new_context_with_model:  SYCL_Host compute buffer size =    24,01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 2
Unexpected pattern!
UNREACHABLE executed at ./lib/SPIRV/SPIRVUtil.cpp:1887!
The program was built for 1 devices
Build program log for 'Intel(R) Arc(TM) A770 Graphics':
 -11 (PI_ERROR_BUILD_PROGRAM_FAILURE)Exception caught at file:/home/sebastien/Projets/llama.cpp/ggml/src/ggml-sycl.cpp, line:2885
NeoZhangJianyu commented 4 months ago

Could you try with stable release? Commit ID: fb76ec31a9914b7761c1727303ab30380fd4f05c If still with issue, could you share the whole log here?

Chimrod commented 4 months ago

Sure this is the log.

build/bin/llama-cli -m Meta-Llama-3-8B-Instruct.Q4_0.gguf -p "Building a website can be done in 10 simple steps:" -ngl 33 --split-mode none --main-gpu 0 --verbose 2> out.log

out.log.txt

Eugeniusz-Gienek commented 3 months ago

Same issue here.... any updates?