ggerganov / llama.cpp

LLM inference in C/C++
MIT License
64.57k stars 9.24k forks source link

Bug: InvalidModule: Invalid SPIR-V module: input SPIR-V module uses extension 'SPV_INTEL_memory_access_aliasing' which were disabled by --spirv-ext option #8551

Closed jpoly1219 closed 17 hours ago

jpoly1219 commented 1 month ago

What happened?

Currently on Fedora 40 with Intel Arc A750.

Running the following:

ZES_ENABLE_SYSMAN=1 ./build/bin/llama-server \
-t 10 \
-ngl 20 \
-b 512 \
--ctx-size 16384 \
-m ~/llama-models/llama-2-7b.Q4_0.gguf \
--color -c 3400 \
--seed 42 \
--temp 0.8 \
--top_k 5 \
--repeat_penalty 1.1 \
--host :: \
--port 8080 \ 
-n -1 \
-sm none -mg 0

gives the following output:

INFO [                    main] build info | tid="140466031364096" timestamp=1721277895 build=3411 commit="e02b597b"
INFO [                    main] system info | tid="140466031364096" timestamp=1721277895 n_threads=10 n_threads_batch=-1 total_threads=28 system_info="AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | "
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from /home/jacob/llama-models/llama-2-7b.Q4_0.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 2
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  17:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  18:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens cache size = 3
llm_load_vocab: token to piece cache size = 0.1684 MB
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 6.74 B
llm_load_print_meta: model size       = 3.56 GiB (4.54 BPW)
llm_load_print_meta: general.name     = LLaMA v2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_print_meta: max token length = 48
ggml_sycl_init: GGML_SYCL_FORCE_MMQ:   no
ggml_sycl_init: SYCL_USE_XMX: yes
ggml_sycl_init: found 5 SYCL devices:
llm_load_tensors: ggml ctx size =    0.27 MiB
llm_load_tensors: offloading 20 repeating layers to GPU
llm_load_tensors: offloaded 20/33 layers to GPU
llm_load_tensors:      SYCL0 buffer size =  2171.88 MiB
llm_load_tensors:        CPU buffer size =  3647.87 MiB
.................................................................................................
llama_new_context_with_model: n_ctx      = 3424
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
[SYCL] call ggml_check_sycl
ggml_check_sycl: GGML_SYCL_DEBUG: 0
ggml_check_sycl: GGML_SYCL_F16: no
found 5 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                Intel Arc A750 Graphics|    1.3|    448|    1024|   32|  8096M|            1.3.28717|
| 1| [level_zero:gpu:1]|                 Intel UHD Graphics 770|    1.3|     32|     512|   32| 62662M|            1.3.28717|
| 2|     [opencl:gpu:0]|                Intel Arc A750 Graphics|    3.0|    448|    1024|   32|  8096M|       24.09.28717.17|
| 3|     [opencl:gpu:1]|                 Intel UHD Graphics 770|    3.0|     32|     512|   32| 62662M|       24.09.28717.17|
| 4|     [opencl:cpu:0]|                   Intel Core i7-14700K|    3.0|     28|    8192|   64| 67164M|2024.18.6.0.02_160000|
llama_kv_cache_init:      SYCL0 KV buffer size =  1070.00 MiB
llama_kv_cache_init:  SYCL_Host KV buffer size =   642.00 MiB
llama_new_context_with_model: KV self size  = 1712.00 MiB, K (f16):  856.00 MiB, V (f16):  856.00 MiB
llama_new_context_with_model:  SYCL_Host  output buffer size =     0.24 MiB
llama_new_context_with_model:      SYCL0 compute buffer size =   300.50 MiB
llama_new_context_with_model:  SYCL_Host compute buffer size =    22.69 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 136
InvalidModule: Invalid SPIR-V module: input SPIR-V module uses extension 'SPV_INTEL_memory_access_aliasing' which were disabled by --spirv-ext option

I'm not sure where the --spirv-ext option is set, but it seems like a compiler flag. What can I do to fix this?

I ran the following to set this up, and the build did not fail:

# Export relevant ENV variables
source /opt/intel/oneapi/setvars.sh

# Build LLAMA with MKL BLAS acceleration for intel GPU

# Option 1: Use FP32 (recommended for better performance in most cases)
cmake -B build -DGGML_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx

# Option 2: Use FP16
cmake -B build -DGGML_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DGGML_SYCL_F16=ON

# build all binary
cmake --build build --config Release -j -v

This is my result for clinfo:

Platform #0: Intel(R) OpenCL
 `-- Device #0: Intel(R) Core(TM) i7-14700K
Platform #1: Intel(R) OpenCL Graphics
 `-- Device #0: Intel(R) Arc(TM) A750 Graphics
Platform #2: Intel(R) OpenCL Graphics
 `-- Device #0: Intel(R) UHD Graphics 770

This is my result for ./build/bin/llama-ls-sycl-device:

found 5 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                Intel Arc A750 Graphics|    1.3|    448|    1024|   32|  8096M|            1.3.28717|
| 1| [level_zero:gpu:1]|                 Intel UHD Graphics 770|    1.3|     32|     512|   32| 62662M|            1.3.28717|
| 2|     [opencl:gpu:0]|                Intel Arc A750 Graphics|    3.0|    448|    1024|   32|  8096M|       24.09.28717.17|
| 3|     [opencl:gpu:1]|                 Intel UHD Graphics 770|    3.0|     32|     512|   32| 62662M|       24.09.28717.17|
| 4|     [opencl:cpu:0]|                   Intel Core i7-14700K|    3.0|     28|    8192|   64| 67164M|2024.18.6.0.02_160000|

Name and Version

~/llama.cpp$ ./build/bin/llama-server --version

version: 3411 (e02b597b) built with Intel(R) oneAPI DPC++/C++ Compiler 2024.2.0 (2024.2.0.20240602) for x86_64-unknown-linux-gnu

What operating system are you seeing the problem on?

Linux

Relevant log output

No response

airMeng commented 1 month ago

can you reproduce the issue via llama-cli?

jpoly1219 commented 1 month ago

Just tried it, same results:

ZES_ENABLE_SYSMAN=1 ./build/bin/llama-cli -m ~/llama-models/llama-2-7b.Q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 33 -sm none -mg 0

Log start
main: build = 3411 (e02b597b)
main: built with Intel(R) oneAPI DPC++/C++ Compiler 2024.2.0 (2024.2.0.20240602) for x86_64-unknown-linux-gnu
main: seed  = 1721288586
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from /home/jacob/llama-models/llama-2-7b.Q4_0.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 2
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  17:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  18:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens cache size = 3
llm_load_vocab: token to piece cache size = 0.1684 MB
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 6.74 B
llm_load_print_meta: model size       = 3.56 GiB (4.54 BPW)
llm_load_print_meta: general.name     = LLaMA v2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_print_meta: max token length = 48
ggml_sycl_init: GGML_SYCL_FORCE_MMQ:   no
ggml_sycl_init: SYCL_USE_XMX: yes
ggml_sycl_init: found 5 SYCL devices:
llm_load_tensors: ggml ctx size =    0.27 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:      SYCL0 buffer size =  3577.56 MiB
llm_load_tensors:        CPU buffer size =    70.31 MiB
..................................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
[SYCL] call ggml_check_sycl
ggml_check_sycl: GGML_SYCL_DEBUG: 0
ggml_check_sycl: GGML_SYCL_F16: no
found 5 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                Intel Arc A750 Graphics|    1.3|    448|    1024|   32|  8096M|            1.3.28717|
| 1| [level_zero:gpu:1]|                 Intel UHD Graphics 770|    1.3|     32|     512|   32| 62662M|            1.3.28717|
| 2|     [opencl:gpu:0]|                Intel Arc A750 Graphics|    3.0|    448|    1024|   32|  8096M|       24.09.28717.17|
| 3|     [opencl:gpu:1]|                 Intel UHD Graphics 770|    3.0|     32|     512|   32| 62662M|       24.09.28717.17|
| 4|     [opencl:cpu:0]|                   Intel Core i7-14700K|    3.0|     28|    8192|   64| 67164M|2024.18.6.0.02_160000|
llama_kv_cache_init:      SYCL0 KV buffer size =  2048.00 MiB
llama_new_context_with_model: KV self size  = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_new_context_with_model:  SYCL_Host  output buffer size =     0.12 MiB
llama_new_context_with_model:      SYCL0 compute buffer size =   296.00 MiB
llama_new_context_with_model:  SYCL_Host compute buffer size =    16.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 2
InvalidModule: Invalid SPIR-V module: input SPIR-V module uses extension 'SPV_INTEL_memory_access_aliasing' which were disabled by --spirv-ext option
airMeng commented 1 month ago

can you reproduce it by UT? for example

./bin/test-backend-ops -b SYCL0 
airMeng commented 1 month ago

I do some searching and found the related issues on intel llvm repo @MrSidims I saw the similar issues https://github.com/intel/llvm/pull/4025#issuecomment-870823000, could you give us some education?

jpoly1219 commented 1 month ago

same results

./bin/test-backend-ops -b SYCL0 

ggml_sycl_init: GGML_SYCL_FORCE_MMQ:   no
ggml_sycl_init: SYCL_USE_XMX: yes
ggml_sycl_init: found 5 SYCL devices:
Testing 6 backends

Backend 1/6 (CPU)
  Skipping
Backend 2/6 (SYCL0)
[SYCL] call ggml_check_sycl
ggml_check_sycl: GGML_SYCL_DEBUG: 0
ggml_check_sycl: GGML_SYCL_F16: no
found 5 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                Intel Arc A750 Graphics|    1.3|    448|    1024|   32|  8096M|            1.3.28717|
| 1| [level_zero:gpu:1]|                 Intel UHD Graphics 770|    1.3|     32|     512|   32| 62662M|            1.3.28717|
| 2|     [opencl:gpu:0]|                Intel Arc A750 Graphics|    3.0|    448|    1024|   32|  8096M|       24.09.28717.17|
| 3|     [opencl:gpu:1]|                 Intel UHD Graphics 770|    3.0|     32|     512|   32| 62662M|       24.09.28717.17|
| 4|     [opencl:cpu:0]|                   Intel Core i7-14700K|    3.0|     28|    8192|   64| 67164M|2024.18.6.0.02_160000|
  Backend name: SYCL0
  ABS(type=f32,ne_a=[128,10,10,10],v=0): not supported [SYCL0]
  ABS(type=f32,ne_a=[7,13,19,23],v=0): not supported [SYCL0]
  SGN(type=f32,ne_a=[128,10,10,10],v=0): not supported [SYCL0]
  SGN(type=f32,ne_a=[7,13,19,23],v=0): not supported [SYCL0]
  NEG(type=f32,ne_a=[128,10,10,10],v=0): not supported [SYCL0]
  NEG(type=f32,ne_a=[7,13,19,23],v=0): not supported [SYCL0]
  STEP(type=f32,ne_a=[128,10,10,10],v=0): not supported [SYCL0]
  STEP(type=f32,ne_a=[7,13,19,23],v=0): not supported [SYCL0]
  TANH(type=f32,ne_a=[128,10,10,10],v=0): InvalidModule: Invalid SPIR-V module: input SPIR-V module uses extension 'SPV_INTEL_memory_access_aliasing' which were disabled by --spirv-ext option
airMeng commented 1 month ago

ok, can you run the tanh alone and see whether it will crash each time?

.\bin\test-backend-ops -b SYCL0 -o TANH
jpoly1219 commented 1 month ago

here are the results:

./build/bin/test-backend-ops -b SYCL0 -o TANH

ggml_sycl_init: GGML_SYCL_FORCE_MMQ:   no
ggml_sycl_init: SYCL_USE_XMX: yes
ggml_sycl_init: found 6 SYCL devices:
Testing 7 backends

Backend 1/7 (CPU)
  Skipping
Backend 2/7 (SYCL0)
[SYCL] call ggml_check_sycl
ggml_check_sycl: GGML_SYCL_DEBUG: 0
ggml_check_sycl: GGML_SYCL_F16: no
found 6 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                Intel Arc A750 Graphics|    1.3|    448|    1024|   32|  8096M|            1.3.28717|
| 1| [level_zero:gpu:1]|                 Intel UHD Graphics 770|    1.3|     32|     512|   32| 62662M|            1.3.28717|
| 2|     [opencl:gpu:0]|                Intel Arc A750 Graphics|    3.0|    448|    1024|   32|  8096M|       24.09.28717.17|
| 3|     [opencl:gpu:1]|                 Intel UHD Graphics 770|    3.0|     32|     512|   32| 62662M|       24.09.28717.17|
| 4|     [opencl:cpu:0]|                   Intel Core i7-14700K|    3.0|     28|    8192|   64| 67164M|2024.18.6.0.02_160000|
| 5|     [opencl:cpu:1]|                   Intel Core i7-14700K|    3.0|     28|    8192|   64| 67164M|2024.18.6.0.02_160000|
  Backend name: SYCL0
  TANH(type=f32,ne_a=[128,10,10,10],v=0): OK
  TANH(type=f32,ne_a=[7,13,19,23],v=0): OK
  TANH(type=f32,ne_a=[128,10,10,10],v=1): not supported [SYCL0]
  TANH(type=f32,ne_a=[7,13,19,23],v=1): not supported [SYCL0]
  1334/1334 tests passed
  Backend SYCL0: OK

Backend 3/7 (SYCL1)
  Skipping
Backend 4/7 (SYCL2)
  Skipping
Backend 5/7 (SYCL3)
  Skipping
Backend 6/7 (SYCL4)
  Skipping
Backend 7/7 (SYCL5)
  Skipping
7/7 backends passed
OK
airMeng commented 1 month ago

It seems there might be a more fundamental issue causing this. As a temporary solution, could you please try updating your driver, operating system, kernel, and oneAPI? This might address the problem.

jpoly1219 commented 1 month ago

I couldn't find instructions on how to install client gpu drivers for fedora 40. The website here https://dgpu-docs.intel.com/driver/client/overview.html only has ubuntu instructions. In the SYCL README, however, it said that it was tested on Fedora Silverblue. How did they install the gpu drivers?

jpoly1219 commented 1 month ago

Update, I tried using the provided Docker image, but I get this error:

/llama-cli: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.38' not found (required by /app/build/src/libllama.so)
/llama-cli: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.38' not found (required by /app/build/ggml/src/libggml.so)
/llama-cli: /lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.32' not found (required by /app/build/ggml/src/libggml.so)

My system has glibc 2.39 installed, as seen here:

ldd --version

ldd (GNU libc) 2.39

I'm not sure if there is a way to downgrade my glibc. This feels like a dangerous thing to do as I'm assuming lots of system files relies on this.

MrSidims commented 1 month ago

To clarify, do you compile just-in-time or ahead-of-time? If this compilation happens in JIT mode, error comes from the compilation or during execution of the application? If ahead-of-time, may I ask you to pass -### or -v to the compilation string and see, at which stage the error was generated?

The problem I have with understanding what is going on related to these facts:

  1. The error comes from SPIR-V to LLVM IR translator, which is being invoked 2 times: 1st during SYCL frontend compilation 2nd during SPIR-V consumption done by Intel Graphics Compiler (IGC). For the frontend part, SYCL frontend passes spirv-ext option and the format is looking like: --spirv-ext=-all,+SPV_extension_name1,+SPV_extension_name2 etc, so all extensions are disabled by default and then those that are used in SYCL are enabled one-by-one.
  2. While this very extension (that is used to translate aliasing metadata (used for strict aliasing)) is not enabled in https://github.com/intel/llvm (which I actually need to fix, thanks for pointing out on it!) the extension is: a. enabled for oneAPI DPC++ compiler many releases ago; b. for the metadata when extension is not enabled - it is just ignored by the translator, no error is emitted (otherwise all of our tests would crash with enabled optimizations) So I can't see, how this error can be emitted during frontend compilation phase (aka in JIT or AOT before calling IGC).
  3. AFAIK IGC doesn't pass options to control extensions to the translator as it's too late, but I might be mistaken. Also this extension AFAIK is supported by IGC/GPU driver, may be driver on the system is outdated.

So if you can share compilation logs from compiler's verbose mode it could help me understanding the problem and could nail it.

airMeng commented 1 month ago

hi @MrSidims thank you for your quick reply. I can confirm no AOT option set currently. the whole compilation command is the following: https://github.com/ggerganov/llama.cpp/blob/0d2c7321e9678e91b760ebe57f0d063856bb018b/ggml/src/CMakeLists.txt#L465-L518

If this compilation happens in JIT mode, error comes from the compilation or during execution of the application? If ahead-of-time, may I ask you to pass -### or -v to the compilation string and see, at which stage the error was generated?

I think the user encounters the issues during execution

jpoly1219 commented 1 month ago
cmake --log-level=VERBOSE -B build -DGGML_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx

-- The C compiler identification is IntelLLVM 2024.2.0
-- The CXX compiler identification is IntelLLVM 2024.2.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /opt/intel/oneapi/compiler/2024.2/bin/icx - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /opt/intel/oneapi/compiler/2024.2/bin/icpx - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Git: /usr/bin/git (found version "2.45.2") 
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE  
-- Found OpenMP_C: -fiopenmp (found version "5.1") 
-- Found OpenMP_CXX: -fiopenmp (found version "5.1") 
-- Found OpenMP: TRUE (found version "5.1")  
-- OpenMP found
-- Using llamafile
-- Found IntelSYCL: /opt/intel/oneapi/compiler/2024.2/include (found version "202001") 
-- MKL_VERSION: 2024.2.0
-- MKL_ROOT: /opt/intel/oneapi/mkl/2024.2
-- MKL_SYCL_ARCH: None, set to ` intel64` by default
-- MKL_ARCH: None, set to ` intel64` by default
-- MKL_SYCL_LINK: None, set to ` dynamic` by default
-- MKL_LINK: None, set to ` dynamic` by default
-- MKL_SYCL_INTERFACE_FULL: None, set to ` intel_ilp64` by default
-- MKL_INTERFACE_FULL: None, set to ` intel_ilp64` by default
-- MKL_SYCL_THREADING: None, set to ` tbb_thread` by default
-- MKL_THREADING: None, set to ` intel_thread` by default
-- MKL_MPI: None, set to ` intelmpi` by default
-- Found /opt/intel/oneapi/mkl/2024.2/lib/libmkl_scalapack_ilp64.so
-- Found /opt/intel/oneapi/mkl/2024.2/lib/libmkl_cdft_core.so
-- Found /opt/intel/oneapi/mkl/2024.2/lib/libmkl_intel_ilp64.so
-- Found /opt/intel/oneapi/mkl/2024.2/lib/libmkl_intel_thread.so
-- Found /opt/intel/oneapi/mkl/2024.2/lib/libmkl_core.so
-- Found /opt/intel/oneapi/mkl/2024.2/lib/libmkl_blacs_intelmpi_ilp64.so
-- Found /opt/intel/oneapi/mkl/2024.2/lib/libmkl_sycl_blas.so
-- Found /opt/intel/oneapi/mkl/2024.2/lib/libmkl_sycl_lapack.so
-- Found /opt/intel/oneapi/mkl/2024.2/lib/libmkl_sycl_dft.so
-- Found /opt/intel/oneapi/mkl/2024.2/lib/libmkl_sycl_sparse.so
-- Found /opt/intel/oneapi/mkl/2024.2/lib/libmkl_sycl_data_fitting.so
-- Found /opt/intel/oneapi/mkl/2024.2/lib/libmkl_sycl_rng.so
-- Found /opt/intel/oneapi/mkl/2024.2/lib/libmkl_sycl_stats.so
-- Found /opt/intel/oneapi/mkl/2024.2/lib/libmkl_sycl_vm.so
-- Found /opt/intel/oneapi/mkl/2024.2/lib/libmkl_tbb_thread.so
-- Found /opt/intel/oneapi/compiler/2024.2/lib/libiomp5.so
-- SYCL found
-- Warning: ccache not found - consider installing it for faster compilation or disable this warning with GGML_CCACHE=OFF
-- CMAKE_SYSTEM_PROCESSOR: x86_64
-- x86 detected
-- Configuring done (1.8s)
-- Generating done (0.1s)
-- Build files have been written to: /home/jacob/llama.cpp/build

below is the attached output of the command cmake --build build --config Release -j -v out.txt

I am compiling following this instruction: https://github.com/ggerganov/llama.cpp/blob/master/docs/backend/SYCL.md

LIke @airMeng said, I don't think the issue arises during compilation, but rather during execution.

thank you so much @airMeng @MrSidims !

airMeng commented 1 month ago

@MrSidims any thoughts?

MrSidims commented 1 month ago

If the error happens in runtime, then it comes from GPU driver, and there it's probably better to submit an issue in IGC github. Also tagging @AGindinson who might be having some insights (but still better to open an issue).

Please also check the driver's version as AFAIK this extension is supported by IGC/GPU driver. oneAPI documents says: "please install the latest GPU driver" which IMHO is not the best idea to describe dependencies, but at least it works.

AGindinson commented 1 month ago

I'd actually recommend submitting an issue to Compute Runtime first - the first thing worth checking is whether the issue lies in their IGC interface code.

If there's any way to experiment with GPU-targetted AOT compilation of DPC++ sources (including -v/-### as mentioned above), it could be very helpful for triage.

github-actions[bot] commented 17 hours ago

This issue was closed because it has been inactive for 14 days since being marked as stale.