Closed barolo closed 4 months ago
@barolo Could you try with example mode file: llama-2-7b.Q4_0.gguf? It will help check the soft/hard ware in your PC.
I'm not sure it working well with llama-2-7b.Q4_K_S.gguf in your case. Maybe you could try with latest code. More issues about IQ4,3,2,1 data types are fixed recently.
@barolo Could you try with example mode file: llama-2-7b.Q4_0.gguf? It will help check the soft/hard ware in your PC.
I'm not sure it working well with llama-2-7b.Q4_K_S.gguf in your case. Maybe you could try with latest code. More issues about IQ4,3,2,1 data types are fixed recently.
I've used one of the models referred by the docs
Alternatively, if you want to save time and space, you can download already converted and quantized models from [TheBloke](https://huggingface.co/TheBloke)...
I'm running latest code, which you can tell from the commit in the log.
@barolo Could you try with example mode file: llama-2-7b.Q4_0.gguf? It will help check the soft/hard ware in your PC.
I'm not sure it working well with llama-2-7b.Q4_K_S.gguf in your case. Maybe you could try with latest code. More issues about IQ4,3,2,1 data types are fixed recently.
The error is identical with "default" model [which was pita to get] and llama freshly built from git.
@barolo 1. Could you share the whole log?
Thank you!
@barolo 1. Could you share the whole log?
2. sycl-ls share the output.
Thank you!
What do you mean by the 'whole log' ?
sycl-ls
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2 [2024.17.3.0.08_160000]
[opencl:cpu:1] Intel(R) OpenCL, 12th Gen Intel(R) Core(TM) i5-1240P OpenCL 3.0 (Build 0) [2024.17.3.0.08_160000]
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Iris(R) Xe Graphics OpenCL 3.0 NEO [24.05.028454]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Iris(R) Xe Graphics 1.3 [1.3.28454]
the log include the content from input cmd to final error appear.
the log include the content from input cmd to final error appear.
That's what I did when I posted the issue?
Your fault log is about model: /home/greggy/Develo/D/models/llama-2-7b.Q4_K_S.gguf Could you try llama-2-7b.Q4_0.gguf? I want to confirm your issue is about model or hardware/software env.
Same issue here, with llama-2-7b.Q4_0.gguf
user@Notebook:~/llama.cpp$ ZES_ENABLE_SYSMAN=1 ./build/bin/main -m models/llama-2-7b.Q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 33 -sm none -mg 0
Log start
main: build = 2967 (b18532a4)
main: built with Intel(R) oneAPI DPC++/C++ Compiler 2024.1.0 (2024.1.0.20240308) for x86_64-unknown-linux-gnu
main: seed = 1716394215
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from models/llama-2-7b.Q4_0.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = LLaMA v2
llama_model_loader: - kv 2: llama.context_length u32 = 4096
llama_model_loader: - kv 3: llama.embedding_length u32 = 4096
llama_model_loader: - kv 4: llama.block_count u32 = 32
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 11008
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 32
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0,000010
llama_model_loader: - kv 10: general.file_type u32 = 2
llama_model_loader: - kv 11: tokenizer.ggml.model str = llama
llama_model_loader: - kv 12: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv 13: tokenizer.ggml.scores arr[f32,32000] = [0,000000, 0,000000, 0,000000, 0,0000...
llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 15: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 17: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 18: general.quantization_version u32 = 2
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q4_0: 225 tensors
llama_model_loader: - type q6_K: 1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format = GGUF V2
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 4096
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 32
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 4096
llm_load_print_meta: n_embd_v_gqa = 4096
llm_load_print_meta: f_norm_eps = 0,0e+00
llm_load_print_meta: f_norm_rms_eps = 1,0e-05
llm_load_print_meta: f_clamp_kqv = 0,0e+00
llm_load_print_meta: f_max_alibi_bias = 0,0e+00
llm_load_print_meta: f_logit_scale = 0,0e+00
llm_load_print_meta: n_ff = 11008
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000,0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 4096
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 7B
llm_load_print_meta: model ftype = Q4_0
llm_load_print_meta: model params = 6,74 B
llm_load_print_meta: model size = 3,56 GiB (4,54 BPW)
llm_load_print_meta: general.name = LLaMA v2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token = 13 '<0x0A>'
[SYCL] call ggml_init_sycl
ggml_init_sycl: GGML_SYCL_DEBUG: 0
ggml_init_sycl: GGML_SYCL_F16: no
found 4 SYCL devices:
| | | | |Max | |Max |Global | |
| | | | |compute|Max work|sub |mem | |
|ID| Device Type| Name|Version|units |group |group|size | Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]| Intel Arc Graphics| 1.3| 128| 1024| 32| 30518M| 1.3.28717|
| 1| [opencl:gpu:0]| Intel Arc Graphics| 3.0| 128| 1024| 32| 30518M| 24.09.28717.17|
| 2| [opencl:cpu:0]| Intel Core Ultra 7 155H| 3.0| 22| 8192| 64| 32968M|2024.17.3.0.08_160000|
| 3| [opencl:acc:0]| Intel FPGA Emulation Device| 1.2| 22|67108864| 64| 32968M|2024.17.3.0.08_160000|
ggml_backend_sycl_set_single_device: use single device: [0]
use 1 SYCL GPUs: [0] with Max compute units:128
llm_load_tensors: ggml ctx size = 0,30 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: SYCL0 buffer size = 3577,56 MiB
llm_load_tensors: CPU buffer size = 70,31 MiB
..................................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 10000,0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: SYCL0 KV buffer size = 256,00 MiB
llama_new_context_with_model: KV self size = 256,00 MiB, K (f16): 128,00 MiB, V (f16): 128,00 MiB
llama_new_context_with_model: SYCL_Host output buffer size = 0,12 MiB
llama_new_context_with_model: SYCL0 compute buffer size = 70,50 MiB
llama_new_context_with_model: SYCL_Host compute buffer size = 9,01 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 2
The program was built for 1 devices
Build program log for 'Intel(R) Arc(TM) Graphics':
-11 (PI_ERROR_BUILD_PROGRAM_FAILURE)Exception caught at file:/home/user/llama.cpp/ggml-sycl.cpp, line:14836
@dikei100 What's the hardware info of your case? I guess it's MTL Arc iGPU. What's the OS?
I don't reproduce your case. But my driver is 1.3.28202, your is 1.3.28717.
Hello, I also have this issue, using debian.
> Unexpected pattern!
UNREACHABLE executed at ./lib/SPIRV/SPIRVUtil.cpp:1887!
The program was built for 1 devices
Build program log for 'Intel(R) Arc(TM) A770 Graphics':
-11 (PI_ERROR_BUILD_PROGRAM_FAILURE)Exception caught at file:/home/sebastien/Projets/llama.cpp/ggml-sycl.cpp, line:14368
I’m using debian stable and the latest drivers privided by intel, this is the output of syscl-ls:
$ sycl-ls
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2 [2024.17.5.0.08_160000.xmain-hotfix]
[opencl:cpu:1] Intel(R) OpenCL, AMD Ryzen 9 7950X 16-Core Processor OpenCL 3.0 (Build 0) [2024.17.5.0.08_160000.xmain-hotfix]
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics OpenCL 3.0 NEO [24.13.029138]
[opencl:cpu:3] Intel(R) OpenCL, AMD Ryzen 9 7950X 16-Core Processor OpenCL 3.0 (Build 0) [2024.17.5.0.08_160000.xmain-hotfix]
[opencl:acc:4] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2 [2024.17.5.0.08_160000.xmain-hotfix]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.29138]
The driver version here comes from the package provided by debian and there is no other version provided.
@dikei100 What's the hardware info of your case? I guess it's MTL Arc iGPU. What's the OS?
I don't reproduce your case. But my driver is 1.3.28202, your is 1.3.28717.
@NeoZhangJianyu Yes, I am using a Notebook with Meteor Lake Intel Core i7 Ultra 155H CPU with the integrated Arc iGPU. OS is Fedora 40 with Plasma KDE Desktop Environment.
I borrow a laptop with Meteor Lake with latest windows driver: 1.3.29283. It's OK. But I have no Linux Meteor Lake to verify it.
I will verify it on Arc 770 on Ubuntu 22.04 with latest driver. Looks like the issue is about driver or level-zero.
Could you try with latest GPU driver or older driver?
There are several similar cases to be fixed by update the GPU driver (rollback).
I do also have the same problem. I'm on fedora 40. Some information about the setup below. I can provide more if need be.
The final error: The program was built for 1 devices Build program log for 'Intel(R) Iris(R) Xe Graphics': -11 (PI_ERROR_BUILD_PROGRAM_FAILURE)Exception caught at file:/home/philippe/Téléchargements/llama.cpp/ggml-sycl.cpp, line:14368
llama.cpp/build-syscl$ bin/ls-sycl-device found 4 SYCL devices:
ID | Device Type | Name | Version | units | group | group | size | Driver version |
---|---|---|---|---|---|---|---|---|
0 | [level_zero:gpu:0] | Intel Iris Xe Graphics | 1.3 | 96 | 512 | 32 | 14678M | 1.3.28717 |
1 | [opencl:gpu:0] | Intel Iris Xe Graphics | 3.0 | 96 | 512 | 32 | 14678M | 24.09.28717.17 |
2 | [opencl:cpu:0] | 11th Gen Intel Core i7-1165G7 @ 2.80GHz | 3.0 | 8 | 8192 | 64 | 16117M | 2024.17.3.0.08_160000 |
3 | [opencl:acc:0] | Intel FPGA Emulation Device | 1.2 | 8 | 67108864 | 64 | 16117M | 2024.17.3.0.08_160000 |
llama.cpp/build-syscl$ rpm -qa|grep level-zero intel-level-zero-24.09.28717.17-1.fc40.x86_64 oneapi-level-zero-1.16.1-1.fc40.x86_64 oneapi-level-zero-devel-1.16.1-1.fc40.x86_64
llama.cpp/build-syscl$ rpm -qa|grep compute intel-compute-runtime-24.09.28717.17-1.fc40.x86_64
What's the oneAPI base toolkit version? Recommend 2024.1 (latest).
It is 2024.1 as far as I can tell. $rpm -q intel-basekit intel-basekit-2024.1.0-589.x86_64
Same here:
$ apt list --installed | grep intel-basekit
intel-basekit-env-2024.1/all,now 2024.1.0-589 all [installed,automatic]
intel-basekit-getting-started-2024.1/all,now 2024.1.0-589 all [installed,automatic]
intel-basekit/all,now 2024.1.0-589 amd64 [installed]
I see it. I guess it's about driver or level-zero running time issue. I have no debian or fedora MTL PC to check this issue.
My suggestion is change the driver and level-zero one by one (newer or older). The code can do nothing for this issue.
I see it. I guess it's about driver or level-zero running time issue. I have no debian or fedora MTL PC to check this issue.
My suggestion is change the driver and level-zero one by one (newer or older). The code can do nothing for this issue.
What do you have? It would be helpful to know what it is supposed to work with [Windows excluded].
This issue was closed because it has been inactive for 14 days since being marked as stale.
Hello, I’ve seen the issue was automatically closed, but the issue persist.
found 3 SYCL devices:
| | | | |Max | |Max |Global | |
| | | | |compute|Max work|sub |mem | |
|ID| Device Type| Name|Version|units |group |group|size | Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]| Intel Arc A770 Graphics| 1.3| 512| 1024| 32| 16225M| 1.3.29735|
| 1| [opencl:gpu:0]| Intel Arc A770 Graphics| 3.0| 512| 1024| 32| 16225M| 24.22.029735|
| 2| [opencl:cpu:0]|AMD Ryzen 9 7950X 16-Core Processor | 3.0| 32| 8192| 64| 32824M|2024.18.6.0.02_160000|
llama_kv_cache_init: SYCL1 KV buffer size = 1024,00 MiB
llama_new_context_with_model: KV self size = 1024,00 MiB, K (f16): 512,00 MiB, V (f16): 512,00 MiB
llama_new_context_with_model: SYCL_Host output buffer size = 0,49 MiB
llama_new_context_with_model: SYCL1 compute buffer size = 560,00 MiB
llama_new_context_with_model: SYCL_Host compute buffer size = 24,01 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 2
Unexpected pattern!
UNREACHABLE executed at ./lib/SPIRV/SPIRVUtil.cpp:1887!
The program was built for 1 devices
Build program log for 'Intel(R) Arc(TM) A770 Graphics':
IGC: Internal Compiler Error: Abnormal termination -11 (PI_ERROR_BUILD_PROGRAM_FAILURE)Exception caught at file:/home/sebastien/Projets/llama.cpp/ggml/src/ggml-sycl.cpp, line:2885
I’ve upgraded libze to the lastest binary proposed by debian (24.22.29735.21
) but there is no change here.
@Chimrod I see you use the device [opencl:gpu:0]. OpenCL has some issue. Please use level-zero device: SYCL0: [level_zero:gpu:0].
Sorry, I’ve made differents tests, with the availables gpu, and the latest copy/paste didn’t use sycl.
This is the result using --main-gpu 0
:
[SYCL] call ggml_check_sycl
ggml_check_sycl: GGML_SYCL_DEBUG: 0
ggml_check_sycl: GGML_SYCL_F16: no
found 3 SYCL devices:
| | | | |Max | |Max |Global | |
| | | | |compute|Max work|sub |mem | |
|ID| Device Type| Name|Version|units |group |group|size | Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]| Intel Arc A770 Graphics| 1.3| 512| 1024| 32| 16225M| 1.3.29735|
| 1| [opencl:gpu:0]| Intel Arc A770 Graphics| 3.0| 512| 1024| 32| 16225M| 24.22.029735|
| 2| [opencl:cpu:0]|AMD Ryzen 9 7950X 16-Core Processor | 3.0| 32| 8192| 64| 32824M|2024.18.6.0.02_160000|
llama_kv_cache_init: SYCL0 KV buffer size = 1024,00 MiB
llama_new_context_with_model: KV self size = 1024,00 MiB, K (f16): 512,00 MiB, V (f16): 512,00 MiB
llama_new_context_with_model: SYCL_Host output buffer size = 0,49 MiB
llama_new_context_with_model: SYCL0 compute buffer size = 560,00 MiB
llama_new_context_with_model: SYCL_Host compute buffer size = 24,01 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 2
Unexpected pattern!
UNREACHABLE executed at ./lib/SPIRV/SPIRVUtil.cpp:1887!
The program was built for 1 devices
Build program log for 'Intel(R) Arc(TM) A770 Graphics':
-11 (PI_ERROR_BUILD_PROGRAM_FAILURE)Exception caught at file:/home/sebastien/Projets/llama.cpp/ggml/src/ggml-sycl.cpp, line:2885
Could you try with stable release? Commit ID: fb76ec31a9914b7761c1727303ab30380fd4f05c If still with issue, could you share the whole log here?
Sure this is the log.
build/bin/llama-cli -m Meta-Llama-3-8B-Instruct.Q4_0.gguf -p "Building a website can be done in 10 simple steps:" -ngl 33 --split-mode none --main-gpu 0 --verbose 2> out.log
Same issue here.... any updates?
Using example script