Closed shailesh837 closed 5 months ago
The w/a works well on me.
By default, mmap is used to read the model file. In some cases, it causes runtime hang issues. Please disable it by passing --no-mmap to the main.exe if faced with the issue.
@shailesh837 It's due to the single GPU is used in this case and the GPU memory is not enough to load whole LLM.
i'm seeing this same error on an Intel Iris Xe (Raptor Lake-P) built into my 13th Gen Intel Core i7-1360P.
i have 64GB of system RAM and it seems like the GPU should be permitted to use up to half of that.
ls-sycl-device shows 53705M available:
$ ./build/bin/ls-sycl-device
found 3 SYCL devices:
| | | | |Max | |Max |Global | |
| | | | |compute|Max work|sub |mem | |
|ID| Device Type| Name|Version|units |group |group|size | Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [opencl:gpu:0]| Intel Iris Xe Graphics| 3.0| 96| 512| 32| 53705M| 23.35.027191|
| 1| [opencl:cpu:0]| 13th Gen Intel Core i7-1360P| 3.0| 16| 8192| 64| 67131M|2024.17.3.0.08_160000|
| 2| [opencl:acc:0]| Intel FPGA Emulation Device| 1.2| 16|67108864| 64| 67131M|2024.17.3.0.08_160000|
i tested with tinyllama-1.1b model which is the smallest i have:
$ du -hs models/tinyllama-1.1b-chat-v0.3.Q5_K_M.gguf
746M models/tinyllama-1.1b-chat-v0.3.Q5_K_M.gguf
full dump of the invocation follows:
$ ZES_ENABLE_SYSMAN=1 ./build/bin/main -m models/tinyllama-1.1b-chat-v0.3.Q5_K_M.gguf -p "Building a website can be done in 10 simple steps:" -e -ngl 33 -sm none -mg 0
Log start
main: build = 2806 (c780e753)
main: built with Intel(R) oneAPI DPC++/C++ Compiler 2024.1.0 (2024.1.0.20240308) for x86_64-unknown-linux-gnu
main: seed = 1715151706
llama_model_loader: loaded meta data with 20 key-value pairs and 201 tensors from models/tinyllama-1.1b-chat-v0.3.Q5_K_M.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = py007_tinyllama-1.1b-chat-v0.3
llama_model_loader: - kv 2: llama.context_length u32 = 2048
llama_model_loader: - kv 3: llama.embedding_length u32 = 2048
llama_model_loader: - kv 4: llama.block_count u32 = 22
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 5632
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 64
llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 4
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: llama.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 11: general.file_type u32 = 17
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32003] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32003] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32003] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 19: general.quantization_version u32 = 2
llama_model_loader: - type f32: 45 tensors
llama_model_loader: - type q5_K: 135 tensors
llama_model_loader: - type q6_K: 21 tensors
llm_load_vocab: special tokens definition check successful ( 262/32003 ).
llm_load_print_meta: format = GGUF V2
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32003
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 2048
llm_load_print_meta: n_embd = 2048
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 4
llm_load_print_meta: n_layer = 22
llm_load_print_meta: n_rot = 64
llm_load_print_meta: n_embd_head_k = 64
llm_load_print_meta: n_embd_head_v = 64
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: n_embd_k_gqa = 256
llm_load_print_meta: n_embd_v_gqa = 256
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 5632
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 2048
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 1B
llm_load_print_meta: model ftype = Q5_K - Medium
llm_load_print_meta: model params = 1.10 B
llm_load_print_meta: model size = 745.12 MiB (5.68 BPW)
llm_load_print_meta: general.name = py007_tinyllama-1.1b-chat-v0.3
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_print_meta: EOT token = 32002 '<|im_end|>'
[SYCL] call ggml_init_sycl
ggml_init_sycl: GGML_SYCL_DEBUG: 0
ggml_init_sycl: GGML_SYCL_F16: no
found 3 SYCL devices:
| | | | |Max | |Max |Global | |
| | | | |compute|Max work|sub |mem | |
|ID| Device Type| Name|Version|units |group |group|size | Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [opencl:gpu:0]| Intel Iris Xe Graphics| 3.0| 96| 512| 32| 53705M| 23.35.027191|
| 1| [opencl:cpu:0]| 13th Gen Intel Core i7-1360P| 3.0| 16| 8192| 64| 67131M|2024.17.3.0.08_160000|
| 2| [opencl:acc:0]| Intel FPGA Emulation Device| 1.2| 16|67108864| 64| 67131M|2024.17.3.0.08_160000|
ggml_backend_sycl_set_single_device: use single device: [0]
use 1 SYCL GPUs: [0] with Max compute units:96
llm_load_tensors: ggml ctx size = 0.20 MiB
llm_load_tensors: offloading 22 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 23/23 layers to GPU
llm_load_tensors: SYCL0 buffer size = 702.15 MiB
llm_load_tensors: CPU buffer size = 42.97 MiB
.....................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: SYCL0 KV buffer size = 11.00 MiB
llama_new_context_with_model: KV self size = 11.00 MiB, K (f16): 5.50 MiB, V (f16): 5.50 MiB
llama_new_context_with_model: SYCL_Host output buffer size = 0.12 MiB
llama_new_context_with_model: SYCL0 compute buffer size = 66.51 MiB
llama_new_context_with_model: SYCL_Host compute buffer size = 5.01 MiB
llama_new_context_with_model: graph nodes = 710
llama_new_context_with_model: graph splits = 2
Native API failed. Native API returns: -5 (PI_ERROR_OUT_OF_RESOURCES) -5 (PI_ERROR_OUT_OF_RESOURCES)
@khimaros You should install level-zero. Current error is running on opencl. It's not supported now.
Additional, could you report it as new issue? More supporters will see it easily.
@NeoZhangJianyu is LevelZero available for iGPU? i'm looking at installation instructions but they only mention datacenter and Arc.
@NeoZhangJianyu thanks. working after installing based on instructions here: https://dgpu-docs.intel.com/driver/client/overview.html
maybe this can be added to README-sycl.md?
performance seems to be slower than pure CPU though, so maybe not worth it on my specific hardware setup.
@khimaros It's great!
When targeting an intel GPU, the user should expect one or more level-zero devices among the available SYCL devices. Please make sure that at least one GPU is present, for instance [ext_oneapi_level_zero:gpu:0] in the sample output below:
thank you. this is probably not the best place for it, but as an FYI:
CPU prompt eval: 75t/s
CPU eval: 27t/s
GPU prompt eval: 30t/s
GPU eval: 15t/s
so in my case, CPU definitely winning.
exciting work nonetheless and look forward to picking up an eGPU enclosure one of these days to offload to discrete! :)
This issue was closed because it has been inactive for 14 days since being marked as stale.
I have an Intel Flex 140, I'm trying to run llama.cpp on Native Ubuntu 22.04 LTS with kernel 6.5.0-26-generic following https://github.com/ggerganov/llama.cpp/blob/master/README-sycl.md but it fails with:
Error:
OS-Release:
./examples/sycl/build.sh Attached as log file, as big to paste here