ggerganov / llama.cpp

LLM inference in C/C++
MIT License
68.44k stars 9.83k forks source link

Crash on Intel GPU Flex 140 GPU running llama.cpp for SYCL #6382

Closed shailesh837 closed 5 months ago

shailesh837 commented 8 months ago

I have an Intel Flex 140, I'm trying to run llama.cpp on Native Ubuntu 22.04 LTS with kernel 6.5.0-26-generic following https://github.com/ggerganov/llama.cpp/blob/master/README-sycl.md but it fails with:

Error:

(llm_env) demo@emr-flex140:~/LLM_Demo/llama.cpp$ **ZES_
[sycl_build_sh.txt](https://github.com/ggerganov/llama.cpp/files/14803817/sycl_build_sh.txt)
ENABLE_SYSMAN=1 ./build/bin/main -m models/llama-2-7b.Q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 33 -sm none -mg 0**
Log start
main: build = 2574 (b9102879)
main: built with Intel(R) oneAPI DPC++/C++ Compiler 2024.1.0 (2024.1.0.20240308) for x86_64-unknown-linux-gnu
main: seed  = 1711710493
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from models/llama-2-7b.Q4_0.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 2
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  17:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  18:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 6.74 B
llm_load_print_meta: model size       = 3.56 GiB (4.54 BPW) 
llm_load_print_meta: general.name     = LLaMA v2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
[SYCL] call ggml_init_sycl
ggml_init_sycl: GGML_SYCL_DEBUG: 0
ggml_init_sycl: GGML_SYCL_F16: no
found 10 SYCL devices:
|  |                  |                                             |Compute   |Max compute|Max work|Max sub|               |
|ID|       Device Type|                                         Name|capability|units      |group   |group  |Global mem size|
|--|------------------|---------------------------------------------|----------|-----------|--------|-------|---------------|
| 0|[level_zero:gpu:0]|            Intel(R) Data Center GPU Flex 140|       1.3|        128|    1024|     32|     5048471552|
| 1|[level_zero:gpu:1]|            Intel(R) Data Center GPU Flex 140|       1.3|        128|    1024|     32|     5048471552|
| 2|[level_zero:gpu:2]|            Intel(R) Data Center GPU Flex 140|       1.3|        128|    1024|     32|     5048471552|
| 3|[level_zero:gpu:3]|            Intel(R) Data Center GPU Flex 140|       1.3|        128|    1024|     32|     5048471552|
| 4|    [opencl:gpu:0]|            Intel(R) Data Center GPU Flex 140|       3.0|        128|    1024|     32|     5048471552|
| 5|    [opencl:gpu:1]|            Intel(R) Data Center GPU Flex 140|       3.0|        128|    1024|     32|     5048471552|
| 6|    [opencl:gpu:2]|            Intel(R) Data Center GPU Flex 140|       3.0|        128|    1024|     32|     5048471552|
| 7|    [opencl:gpu:3]|            Intel(R) Data Center GPU Flex 140|       3.0|        128|    1024|     32|     5048471552|
| 8|    [opencl:cpu:0]|               INTEL(R) XEON(R) PLATINUM 8570|       3.0|        224|    8192|     64|   269957541888|
| 9|    [opencl:acc:0]|               Intel(R) FPGA Emulation Device|       1.2|        224|67108864|     64|   269957541888|
ggml_backend_sycl_set_single_device: use single device: [0]
use 1 SYCL GPUs: [0] with Max compute units:128
llm_load_tensors: ggml ctx size =    0.22 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:      SYCL0 buffer size =  3577.56 MiB
llm_load_tensors:        CPU buffer size =    70.31 MiB
Native API failed. Native API returns: -5 (PI_ERROR_OUT_OF_RESOURCES) -5 (PI_ERROR_OUT_OF_RESOURCES)
Exception caught at file:/home/demo/LLM_Demo/llama.cpp/ggml-sycl.cpp, line:16240, func:operator()
SYCL error: CHECK_TRY_ERROR((*stream) .memcpy((char *)tensor->data + offset, data, size) .wait()): Meet error in this line code!
  in function ggml_backend_sycl_buffer_set_tensor at /home/demo/LLM_Demo/llama.cpp/ggml-sycl.cpp:16240
GGML_ASSERT: /home/demo/LLM_Demo/llama.cpp/ggml-sycl.cpp:3035: !"SYCL error"
Could not attach to process.  If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Operation not permitted.
No stack.
The program is not being run.
Aborted (core dumped)

OS-Release:

(llm_env) demo@emr-flex140:~/LLM_Demo/llama.cpp$ cat /etc/os-release 
PRETTY_NAME="Ubuntu 22.04.4 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.4 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy

(llm_env) demo@emr-flex140:~/LLM_Demo/llama.cpp$ ./build/bin/ls-sycl-device
found 10 SYCL devices:
|  |                  |                                             |Compute   |Max compute|Max work|Max sub|               |
|ID|       Device Type|                                         Name|capability|units      |group   |group  |Global mem size|
|--|------------------|---------------------------------------------|----------|-----------|--------|-------|---------------|
| 0|[level_zero:gpu:0]|            Intel(R) Data Center GPU Flex 140|       1.3|        128|    1024|     32|     5048471552|
| 1|[level_zero:gpu:1]|            Intel(R) Data Center GPU Flex 140|       1.3|        128|    1024|     32|     5048471552|
| 2|[level_zero:gpu:2]|            Intel(R) Data Center GPU Flex 140|       1.3|        128|    1024|     32|     5048471552|
| 3|[level_zero:gpu:3]|            Intel(R) Data Center GPU Flex 140|       1.3|        128|    1024|     32|     5048471552|
| 4|    [opencl:gpu:0]|            Intel(R) Data Center GPU Flex 140|       3.0|        128|    1024|     32|     5048471552|
| 5|    [opencl:gpu:1]|            Intel(R) Data Center GPU Flex 140|       3.0|        128|    1024|     32|     5048471552|
| 6|    [opencl:gpu:2]|            Intel(R) Data Center GPU Flex 140|       3.0|        128|    1024|     32|     5048471552|
| 7|    [opencl:gpu:3]|            Intel(R) Data Center GPU Flex 140|       3.0|        128|    1024|     32|     5048471552|
| 8|    [opencl:cpu:0]|               INTEL(R) XEON(R) PLATINUM 8570|       3.0|        224|    8192|     64|   269957541888|
| 9|    [opencl:acc:0]|               Intel(R) FPGA Emulation Device|       1.2|        224|67108864|     64|   269957541888|

./examples/sycl/build.sh Attached as log file, as big to paste here

zj040045 commented 7 months ago

The w/a works well on me. By default, mmap is used to read the model file. In some cases, it causes runtime hang issues. Please disable it by passing --no-mmap to the main.exe if faced with the issue.

NeoZhangJianyu commented 7 months ago

@shailesh837 It's due to the single GPU is used in this case and the GPU memory is not enough to load whole LLM.

  1. please check the GPU free memory is more than 4GB in this case.
  2. suggest using multiple cards model to share the LLM if possible. rm "-sm none "
khimaros commented 6 months ago

i'm seeing this same error on an Intel Iris Xe (Raptor Lake-P) built into my 13th Gen Intel Core i7-1360P.

i have 64GB of system RAM and it seems like the GPU should be permitted to use up to half of that.

ls-sycl-device shows 53705M available:

$ ./build/bin/ls-sycl-device 
found 3 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0|     [opencl:gpu:0]|                 Intel Iris Xe Graphics|    3.0|     96|     512|   32| 53705M|         23.35.027191|
| 1|     [opencl:cpu:0]|           13th Gen Intel Core i7-1360P|    3.0|     16|    8192|   64| 67131M|2024.17.3.0.08_160000|
| 2|     [opencl:acc:0]|            Intel FPGA Emulation Device|    1.2|     16|67108864|   64| 67131M|2024.17.3.0.08_160000|

i tested with tinyllama-1.1b model which is the smallest i have:

$ du -hs models/tinyllama-1.1b-chat-v0.3.Q5_K_M.gguf 
746M    models/tinyllama-1.1b-chat-v0.3.Q5_K_M.gguf

full dump of the invocation follows:

$ ZES_ENABLE_SYSMAN=1 ./build/bin/main -m models/tinyllama-1.1b-chat-v0.3.Q5_K_M.gguf -p "Building a website can be done in 10 simple steps:" -e -ngl 33 -sm none -mg 0
Log start
main: build = 2806 (c780e753)
main: built with Intel(R) oneAPI DPC++/C++ Compiler 2024.1.0 (2024.1.0.20240308) for x86_64-unknown-linux-gnu
main: seed  = 1715151706
llama_model_loader: loaded meta data with 20 key-value pairs and 201 tensors from models/tinyllama-1.1b-chat-v0.3.Q5_K_M.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = py007_tinyllama-1.1b-chat-v0.3
llama_model_loader: - kv   2:                       llama.context_length u32              = 2048
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 2048
llama_model_loader: - kv   4:                          llama.block_count u32              = 22
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 5632
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 64
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 4
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 17
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32003]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32003]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32003]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   45 tensors
llama_model_loader: - type q5_K:  135 tensors
llama_model_loader: - type q6_K:   21 tensors
llm_load_vocab: special tokens definition check successful ( 262/32003 ).
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32003
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 2048
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 4
llm_load_print_meta: n_layer          = 22
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_embd_head_k    = 64
llm_load_print_meta: n_embd_head_v    = 64
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: n_embd_k_gqa     = 256
llm_load_print_meta: n_embd_v_gqa     = 256
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 5632
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 1B
llm_load_print_meta: model ftype      = Q5_K - Medium
llm_load_print_meta: model params     = 1.10 B
llm_load_print_meta: model size       = 745.12 MiB (5.68 BPW) 
llm_load_print_meta: general.name     = py007_tinyllama-1.1b-chat-v0.3
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_print_meta: EOT token        = 32002 '<|im_end|>'
[SYCL] call ggml_init_sycl
ggml_init_sycl: GGML_SYCL_DEBUG: 0
ggml_init_sycl: GGML_SYCL_F16: no
found 3 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0|     [opencl:gpu:0]|                 Intel Iris Xe Graphics|    3.0|     96|     512|   32| 53705M|         23.35.027191|
| 1|     [opencl:cpu:0]|           13th Gen Intel Core i7-1360P|    3.0|     16|    8192|   64| 67131M|2024.17.3.0.08_160000|
| 2|     [opencl:acc:0]|            Intel FPGA Emulation Device|    1.2|     16|67108864|   64| 67131M|2024.17.3.0.08_160000|
ggml_backend_sycl_set_single_device: use single device: [0]
use 1 SYCL GPUs: [0] with Max compute units:96
llm_load_tensors: ggml ctx size =    0.20 MiB
llm_load_tensors: offloading 22 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 23/23 layers to GPU
llm_load_tensors:      SYCL0 buffer size =   702.15 MiB
llm_load_tensors:        CPU buffer size =    42.97 MiB
.....................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      SYCL0 KV buffer size =    11.00 MiB
llama_new_context_with_model: KV self size  =   11.00 MiB, K (f16):    5.50 MiB, V (f16):    5.50 MiB
llama_new_context_with_model:  SYCL_Host  output buffer size =     0.12 MiB
llama_new_context_with_model:      SYCL0 compute buffer size =    66.51 MiB
llama_new_context_with_model:  SYCL_Host compute buffer size =     5.01 MiB
llama_new_context_with_model: graph nodes  = 710
llama_new_context_with_model: graph splits = 2
Native API failed. Native API returns: -5 (PI_ERROR_OUT_OF_RESOURCES) -5 (PI_ERROR_OUT_OF_RESOURCES)
NeoZhangJianyu commented 6 months ago

@khimaros You should install level-zero. Current error is running on opencl. It's not supported now.

Additional, could you report it as new issue? More supporters will see it easily.

khimaros commented 6 months ago

@NeoZhangJianyu is LevelZero available for iGPU? i'm looking at installation instructions but they only mention datacenter and Arc.

khimaros commented 6 months ago

@NeoZhangJianyu thanks. working after installing based on instructions here: https://dgpu-docs.intel.com/driver/client/overview.html

maybe this can be added to README-sycl.md?

performance seems to be slower than pure CPU though, so maybe not worth it on my specific hardware setup.

NeoZhangJianyu commented 6 months ago

@khimaros It's great!

  1. The Intel GPU driver install guide link is already added to README-sycl.md.
  2. level-zero is required by Intel GPU and mentioned in the README-sycl.md:
    When targeting an intel GPU, the user should expect one or more level-zero devices among the available SYCL devices. Please make sure that at least one GPU is present, for instance [ext_oneapi_level_zero:gpu:0] in the sample output below:
  3. Intel CPU is optimized based on AVX2/AVX512. For LLM, it also works well now. Use CPU or GPU, depend on your case. Your iGPU is weak than your CPU in fact. :)
khimaros commented 6 months ago

thank you. this is probably not the best place for it, but as an FYI:

CPU prompt eval: 75t/s
CPU eval: 27t/s

GPU prompt eval: 30t/s
GPU eval: 15t/s

so in my case, CPU definitely winning.

exciting work nonetheless and look forward to picking up an eGPU enclosure one of these days to offload to discrete! :)

github-actions[bot] commented 5 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.