[SYCL] GGML_ASSERT issue when running llama.cpp with SYCL on A770

ggerganov / llama.cpp

LLM inference in C/C++

MIT License

65.54k stars 9.4k forks source link

[SYCL] GGML_ASSERT issue when running llama.cpp with SYCL on A770 #5513

Closed aahouzi closed 3 months ago

aahouzi commented 7 months ago

Current Behavior:

Built llama.cpp with sycl backend for Windows by following instructions in README-sycl.md.
The build completes successfully, the conversion and everything works fine.
When running the main, the code errors out with due to a GGMLASSERT issue. Tried to debug it and seems like when this function get_device_index_by_id is being called the returned id is equal to -1, and then the error happens when assert statement GGML_ASSERT(res>=0);_ finds res=-1 . My device number is 5 as u can see in the logs.
@airMeng @NeoZhangJianyu cc here, tried all tricks for known issues in the README-sycl, but this didn't lead anywhere..

C:\Users\intel\Desktop\aahouzi\llama.cpp>set GGML_SYCL_DEVICE=5 && build\bin\main.exe -m %LLAMA2%\ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512 --no-mmap -ngl 33 --ignore-eos
Log start
main: build = 2153 (0d417712)
main: built with IntelLLVM 2024.0.2 for
main: seed  = 1708016072
GGML_SYCL_DEBUG=0
ggml_init_sycl: GGML_SYCL_F16:   no
ggml_init_sycl: SYCL_USE_XMX: yes
found 6 SYCL devices:
  Device 0: Intel(R) UHD Graphics 770,  compute capability 1.3,
        max compute_units 32,   max work group size 512,        max sub group size 32,  global mem size 3093630976
  Device 1: Intel(R) FPGA Emulation Device,     compute capability 1.2,
        max compute_units 32,   max work group size 67108864,   max sub group size 64,  global mem size 3839483904
  Device 2: 13th Gen Intel(R) Core(TM) i9-13900K,       compute capability 3.0,
        max compute_units 32,   max work group size 8192,       max sub group size 64,  global mem size 3839483904
  Device 3: Intel(R) Arc(TM) A770 Graphics,     compute capability 3.0,
        max compute_units 512,  max work group size 1024,       max sub group size 32,  global mem size 3819835392
  Device 4: Intel(R) UHD Graphics 770,  compute capability 3.0,
        max compute_units 32,   max work group size 512,        max sub group size 32,  global mem size 3093630976
  Device 5: Intel(R) Arc(TM) A770 Graphics,     compute capability 1.3,
        max compute_units 512,  max work group size 1024,       max sub group size 32,  global mem size 3819835392
Using device 5 (Intel(R) Arc(TM) A770 Graphics) as main device
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from C:\Users\intel\.cache\huggingface\hub\models--meta-llama--Llama-2-7b-chat-hf\snapshots\c1b0db933684edbfe29a06fa47eb19cc48025e93\ggml-model-q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 2
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  17:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  18:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  19:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% if messages[0]['role'] == 'system'...
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 6.74 B
llm_load_print_meta: model size       = 3.56 GiB (4.54 BPW)
llm_load_print_meta: general.name     = LLaMA v2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.22 MiB
GGML_ASSERT: C:/Users/intel/Desktop/aahouzi/llama.cpp/ggml-sycl.cpp:9364: res>=0

Steps To Reproduce:

Same steps in README-sycl.md

Environment:

OS: Win11
HW: Intel ARC A770 dGPU

airMeng commented 7 months ago

have you tried GGML_SYCL_DEVICE=3?

This is wield because mostly dGPU will appear as the first device, but in your case is 3 and 5. Can you try the following and paste the output here?

source /PATH/TO/ONEAPI/setvars.sh
sycl-ls

I guess the issue is that you select OpenCL device in OneAPI, but we only fully verified on LevelZero (usually should be the first default device)

aahouzi commented 7 months ago

Yes, I tried GGML_SYCL_DEVICE=3, but same issue here.

I got this:

C:\Users\intel>sycl-ls
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:cpu:1] Intel(R) OpenCL, 13th Gen Intel(R) Core(TM) i9-13900K OpenCL 3.0 (Build 0) [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) UHD Graphics 770 OpenCL 3.0 NEO  [31.0.101.5186]
[opencl:gpu:3] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics OpenCL 3.0 NEO  [31.0.101.5186]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) UHD Graphics 770 1.3 [1.3.28044]
[ext_oneapi_level_zero:gpu:1] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.28044]

and when I run the sycl device executable, I get this:

C:\Users\intel\Desktop\aahouzi\llama.cpp>build\bin\ls-sycl-device.exe
found 6 SYCL devices:
  Device 0: Intel(R) UHD Graphics 770,  compute capability 1.3,
        max compute_units 32,   max work group size 512,        max sub group size 32,  global mem size 3093630976
  Device 1: Intel(R) FPGA Emulation Device,     compute capability 1.2,
        max compute_units 32,   max work group size 67108864,   max sub group size 64,  global mem size 3839483904
  Device 2: 13th Gen Intel(R) Core(TM) i9-13900K,       compute capability 3.0,
        max compute_units 32,   max work group size 8192,       max sub group size 64,  global mem size 3839483904
  Device 3: Intel(R) Arc(TM) A770 Graphics,     compute capability 3.0,
        max compute_units 512,  max work group size 1024,       max sub group size 32,  global mem size 3819835392
  Device 4: Intel(R) UHD Graphics 770,  compute capability 3.0,
        max compute_units 32,   max work group size 512,        max sub group size 32,  global mem size 3093630976
  Device 5: Intel(R) Arc(TM) A770 Graphics,     compute capability 1.3,
        max compute_units 512,  max work group size 1024,       max sub group size 32,  global mem size 3819835392

I don't think I'm selecting the OpenCL device in oneAPI; it's clearly mentioned in the logs that this is level_zero. In your PR #5208, you got the build on Windows working, but did you try running it on multiple Windows platforms to ensure that it's properly functioning on Windows ?

NeoZhangJianyu commented 7 months ago

The device selection is with issue when there are igpu & Arc GPU in same PC. It has been fixed in multiple GPU support feature. But this feature is ongoing and not merged.

Please try it:

export ONEAPI_DEVICE_SELECTOR="level_zero:gpu"
export GGML_SYCL_DEVICE=0 or 1

Thank you!

aahouzi commented 7 months ago

After the change, only level_zero devices are being shown but the issue is still there:

C:\Users\intel\Desktop\aahouzi\llama.cpp>set ONEAPI_DEVICE_SELECTOR="level_zero:gpu" && set GGML_SYCL_DEVICE=1 && build\bin\main.exe -m %LLAMA2%\ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512 --no-mmap -ngl 33 --ignore-eos
Log start
main: build = 2153 (0d417712)
main: built with IntelLLVM 2024.0.2 for
main: seed  = 1708337936
GGML_SYCL_DEBUG=0
ggml_init_sycl: GGML_SYCL_F16:   no
ggml_init_sycl: SYCL_USE_XMX: yes
found 2 SYCL devices:
  Device 0: Intel(R) UHD Graphics 770,  compute capability 1.3,
        max compute_units 32,   max work group size 512,        max sub group size 32,  global mem size 3093630976
  Device 1: Intel(R) Arc(TM) A770 Graphics,     compute capability 1.3,
        max compute_units 512,  max work group size 1024,       max sub group size 32,  global mem size 3819835392
Using device 1 (Intel(R) Arc(TM) A770 Graphics) as main device
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from C:\Users\intel\.cache\huggingface\hub\models--meta-llama--Llama-2-7b-chat-hf\snapshots\c1b0db933684edbfe29a06fa47eb19cc48025e93\ggml-model-q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 2
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  17:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  18:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  19:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% if messages[0]['role'] == 'system'...
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 6.74 B
llm_load_print_meta: model size       = 3.56 GiB (4.54 BPW)
llm_load_print_meta: general.name     = LLaMA v2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.22 MiB
GGML_ASSERT: C:/Users/intel/Desktop/aahouzi/llama.cpp/ggml-sycl.cpp:9364: res>=0

NeoZhangJianyu commented 7 months ago

how about device_id=0? I think it has been supported. Maybe some new code break it. could you try with old release. like https://github.com/jordankanter/llama.cpp/commit/8c4aa67ff9b366b0dcce760d8b0c91b77f95f2fe.

aahouzi commented 7 months ago

how about device_id=0? I think it has been supported. Maybe some new code break it.

Tried with device 0, this time there is no GGML_ASSERT issue but the execution just hangs with no output. I tried adding --no-mmap option but the issue is still there:

C:\Users\intel\Desktop\aahouzi\llama.cpp>set GGML_SYCL_DEVICE=0 && build\bin\main.exe -m %LLAMA2% -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e -ngl 33 -s 0 --no-mmap
Log start
main: build = 2153 (0d417712)
main: built with IntelLLVM 2024.0.2 for
main: seed  = 0
GGML_SYCL_DEBUG=0
ggml_init_sycl: GGML_SYCL_F16:   no
ggml_init_sycl: SYCL_USE_XMX: yes
found 6 SYCL devices:
  Device 0: Intel(R) UHD Graphics 770,  compute capability 1.3,
        max compute_units 32,   max work group size 512,        max sub group size 32,  global mem size 3093630976
  Device 1: Intel(R) FPGA Emulation Device,     compute capability 1.2,
        max compute_units 32,   max work group size 67108864,   max sub group size 64,  global mem size 3839483904
  Device 2: 13th Gen Intel(R) Core(TM) i9-13900K,       compute capability 3.0,
        max compute_units 32,   max work group size 8192,       max sub group size 64,  global mem size 3839483904
  Device 3: Intel(R) Arc(TM) A770 Graphics,     compute capability 3.0,
        max compute_units 512,  max work group size 1024,       max sub group size 32,  global mem size 3819835392
  Device 4: Intel(R) UHD Graphics 770,  compute capability 3.0,
        max compute_units 32,   max work group size 512,        max sub group size 32,  global mem size 3093630976
  Device 5: Intel(R) Arc(TM) A770 Graphics,     compute capability 1.3,
        max compute_units 512,  max work group size 1024,       max sub group size 32,  global mem size 3819835392
Using device 0 (Intel(R) UHD Graphics 770) as main device
...
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:            buffer size =  3577.56 MiB
llm_load_tensors:        CPU buffer size =    70.31 MiB
.................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:            KV buffer size =   256.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:        CPU input buffer size   =    10.01 MiB

My iGPU has 7.8GB memory, and I think loading a llama2-7B-Q4_0 will require 3.9GB, so I should be fine ? When I monitor activity, iGPU usage was near 1% and memory was occupied up to 4.2GB. Also, I tried changing the number of layers offloaded to iGPU but this didn't change anything.

could you try with old release. like https://github.com/jordankanter/llama.cpp/commit/8c4aa67ff9b366b0dcce760d8b0c91b77f95f2fe.

With this release https://github.com/jordankanter/llama.cpp/commit/8c4aa67ff9b366b0dcce760d8b0c91b77f95f2fe, the GGML_ASSERT issue is still there for A770. However, iGPU works this time but doesn't generate any text it just shows the prompt, and stop there (iGPU usage was 96%):

C:\Users\intel\Desktop\aahouzi\llama.cpp>set GGML_SYCL_DEVICE=0 && build\bin\main.exe -m %LLAMA2% -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e -ngl 33 -s 0
Log start
main: build = 2032 (8c4aa67)
main: built with IntelLLVM 2024.0.2 for
main: seed  = 0
GGML_SYCL_DEBUG=0
ggml_init_sycl: GGML_SYCL_FP16:   no
ggml_init_sycl: SYCL_USE_XMX: yes
found 6 SYCL devices:
  Device 0: Intel(R) UHD Graphics 770,  compute capability 1.3,
        max compute_units 32,   max work group size 512,        max sub group size 32,  global mem size 3093630976
  Device 1: Intel(R) FPGA Emulation Device,     compute capability 1.2,
        max compute_units 32,   max work group size 67108864,   max sub group size 64,  global mem size 3839483904
  Device 2: 13th Gen Intel(R) Core(TM) i9-13900K,       compute capability 3.0,
        max compute_units 32,   max work group size 8192,       max sub group size 64,  global mem size 3839483904
  Device 3: Intel(R) Arc(TM) A770 Graphics,     compute capability 3.0,
        max compute_units 512,  max work group size 1024,       max sub group size 32,  global mem size 3819835392
  Device 4: Intel(R) UHD Graphics 770,  compute capability 3.0,
        max compute_units 32,   max work group size 512,        max sub group size 32,  global mem size 3093630976
  Device 5: Intel(R) Arc(TM) A770 Graphics,     compute capability 1.3,
        max compute_units 512,  max work group size 1024,       max sub group size 32,  global mem size 3819835392
Using device 0 (Intel(R) UHD Graphics 770) as main device
...
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:            buffer size =  3577.56 MiB
llm_load_tensors:        CPU buffer size =    70.31 MiB
.................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:            KV buffer size =   256.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:        CPU input buffer size   =     9.01 MiB
llama_new_context_with_model:            compute buffer size =    77.55 MiB
llama_new_context_with_model:        CPU compute buffer size =     8.80 MiB
llama_new_context_with_model: graph splits (measure): 3

system_info: n_threads = 16 / 32 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp
generate: n_ctx = 512, n_batch = 512, n_predict = 400, n_keep = 0

 Building a website can be done in 10 simple steps:
Step 1:
llama_print_timings:        load time =    8526.53 ms
llama_print_timings:      sample time =      21.66 ms /   400 runs   (    0.05 ms per token, 18462.96 tokens per second)
llama_print_timings: prompt eval time =    2613.48 ms /    19 tokens (  137.55 ms per token,     7.27 tokens per second)
llama_print_timings:        eval time =  124352.76 ms /   399 runs   (  311.66 ms per token,     3.21 tokens per second)
llama_print_timings:       total time =  127047.39 ms /   418 tokens
Log end

When trying to offload only few layers to iGPU (-ngl=15), the generation starts but the output is gibberish:

C:\Users\intel\Desktop\aahouzi\llama.cpp>set GGML_SYCL_DEVICE=0 && build\bin\main.exe -m %LLAMA2% -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e -ngl 15 -s 0
Log start
main: build = 2032 (8c4aa67)
main: built with IntelLLVM 2024.0.2 for
main: seed  = 0
GGML_SYCL_DEBUG=0
ggml_init_sycl: GGML_SYCL_FP16:   no
ggml_init_sycl: SYCL_USE_XMX: yes
found 6 SYCL devices:
  Device 0: Intel(R) UHD Graphics 770,  compute capability 1.3,
        max compute_units 32,   max work group size 512,        max sub group size 32,  global mem size 3093630976
  Device 1: Intel(R) FPGA Emulation Device,     compute capability 1.2,
        max compute_units 32,   max work group size 67108864,   max sub group size 64,  global mem size 3839483904
  Device 2: 13th Gen Intel(R) Core(TM) i9-13900K,       compute capability 3.0,
        max compute_units 32,   max work group size 8192,       max sub group size 64,  global mem size 3839483904
  Device 3: Intel(R) Arc(TM) A770 Graphics,     compute capability 3.0,
        max compute_units 512,  max work group size 1024,       max sub group size 32,  global mem size 3819835392
  Device 4: Intel(R) UHD Graphics 770,  compute capability 3.0,
        max compute_units 32,   max work group size 512,        max sub group size 32,  global mem size 3093630976
  Device 5: Intel(R) Arc(TM) A770 Graphics,     compute capability 1.3,
        max compute_units 512,  max work group size 1024,       max sub group size 32,  global mem size 3819835392
Using device 0 (Intel(R) UHD Graphics 770) as main device
...
llm_load_tensors: offloading 15 repeating layers to GPU
llm_load_tensors: offloaded 15/33 layers to GPU
llm_load_tensors:            buffer size =  1628.91 MiB
llm_load_tensors:        CPU buffer size =  3647.87 MiB
..................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:            KV buffer size =   120.00 MiB
llama_kv_cache_init:        CPU KV buffer size =   136.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:        CPU input buffer size   =     9.01 MiB
llama_new_context_with_model:            compute buffer size =    74.80 MiB
llama_new_context_with_model:        CPU compute buffer size =    77.55 MiB
llama_new_context_with_model: graph splits (measure): 5

system_info: n_threads = 16 / 32 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp
generate: n_ctx = 512, n_batch = 512, n_predict = 400, n_keep = 0

 Building a website can be done in 10 simple steps:
Step 1:"
$▅▅!

 #
␦#
␦▅▅▅#""$
$
␦
$

!
!"▅
# [end of text]

llama_print_timings:        load time =    6108.94 ms
llama_print_timings:      sample time =      16.16 ms /   168 runs   (    0.10 ms per token, 10393.47 tokens per second)
llama_print_timings: prompt eval time =    1569.78 ms /    19 tokens (   82.62 ms per token,    12.10 tokens per second)
llama_print_timings:        eval time =   35977.66 ms /   167 runs   (  215.44 ms per token,     4.64 tokens per second)
llama_print_timings:       total time =   37609.83 ms /   186 tokens
Log end

aahouzi commented 7 months ago

@NeoZhangJianyu On a different note: When I try latest llama.cpp on MTL iGPU Windows, the code hangs with no output. When I switch to https://github.com/jordankanter/llama.cpp/commit/8c4aa67ff9b366b0dcce760d8b0c91b77f95f2fe, I can run llama2-7B-Q4_0.gguf fully on iGPU (only 5.8GB) and with really good text output quality:

C:\Users\Intel\Desktop\aahouzi\llama.cpp>set GGML_SYCL_DEVICE=0 && build\bin\main.exe -m ..\llama-2-7b.Q4_0.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e -ngl 33 -s 0
Log start
main: build = 2032 (8c4aa67)
main: built with IntelLLVM 2024.0.2 for
main: seed  = 0
GGML_SYCL_DEBUG=0
ggml_init_sycl: GGML_SYCL_FP16:   no
ggml_init_sycl: SYCL_USE_XMX: yes
found 4 SYCL devices:
  Device 0: Intel(R) Arc(TM) Graphics,  compute capability 1.3,
        max compute_units 128,  max work group size 1024,       max sub group size 32,  global mem size 1132294144
  Device 1: Intel(R) FPGA Emulation Device,     compute capability 1.2,
        max compute_units 22,   max work group size 67108864,   max sub group size 64,  global mem size 3961389056
  Device 2: Intel(R) Core(TM) Ultra 7 165H,     compute capability 3.0,
        max compute_units 22,   max work group size 8192,       max sub group size 64,  global mem size 3961389056
  Device 3: Intel(R) Arc(TM) Graphics,  compute capability 3.0,
        max compute_units 128,  max work group size 1024,       max sub group size 32,  global mem size 1132294144
Using device 0 (Intel(R) Arc(TM) Graphics) as main device
...
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:            buffer size =  3577.56 MiB
llm_load_tensors:        CPU buffer size =    70.31 MiB
.................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:            KV buffer size =   256.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:        CPU input buffer size   =     9.01 MiB
llama_new_context_with_model:            compute buffer size =    77.55 MiB
llama_new_context_with_model:        CPU compute buffer size =     8.80 MiB
llama_new_context_with_model: graph splits (measure): 3

system_info: n_threads = 11 / 22 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp
generate: n_ctx = 512, n_batch = 512, n_predict = 400, n_keep = 0

 Building a website can be done in 10 simple steps:
Step 1: Get Domain and Hosting
The first step is to get your domain name and hosting account. Your domain will serve as the address of your site, while hosting will provide you with space on which to build your site and make it available for people to visit. When you purchase hosting, you’ll also have access to other services like a website builder (which we recommend), a WordPress installer, an SSL certificate, etc.
Once you have these, the next step is to create a website using the tools provided by your web host or by purchasing a third-party site builder, such as Squarespace or Wix. You can do this yourself if you’d like, but we don’t recommend it unless you already know how to code. If not, hire someone who does!
Step 3: Create Your Website Content
The next step is to create your website content by writing text, uploading images and videos, or creating multimedia elements such as slideshows and music tracks (if applicable). This process usually takes about a month if done properly. Once you’ve created all the necessary components of your site—including graphics for headers/footers, menus, etc.—you can begin setting up navigation links between pages using HTML code or through an online tool such as WordPress (or both!).
Step 4: Optimize Your Site to Appear Higher in Search Results
The next step is to optimize your site so that it appears higher in search engine results. This can be done by improving its SEO, which stands for “search engine optimization.” It’s the process of increasing a website’s visibility in online searches. You’ll want to do this because if people can’t find you when they type certain keywords into Google or Bing, then there’s no point in having them visit your site!
There are many ways you could go about optimizing your site for better SE
llama_print_timings:        load time =   17439.87 ms
llama_print_timings:      sample time =      68.66 ms /   400 runs   (    0.17 ms per token,  5826.06 tokens per second)
llama_print_timings: prompt eval time =    1441.73 ms /    19 tokens (   75.88 ms per token,    13.18 tokens per second)
llama_print_timings:        eval time =   52801.55 ms /   399 runs   (  132.33 ms per token,     7.56 tokens per second)
llama_print_timings:       total time =   54530.25 ms /   418 tokens
Log end

Based on this, I think that the theory that some code broke the support might be actually true..

airMeng commented 7 months ago

https://github.com/ggerganov/llama.cpp/pull/5624

@aahouzi I think the "hanging" issues has been solved by the above PR, did you use this commit?

aahouzi commented 7 months ago

@airMeng using latest branch with with #5624 changes eliminates the hang issue. However, I'm now in the same situation as using https://github.com/jordankanter/llama.cpp/commit/8c4aa67ff9b366b0dcce760d8b0c91b77f95f2fe: When offloading all layers the model generates nothing, and if offloading few layers the generation is gibberish.

For ARC A770, the GGML_ASSERT issue is still there though xd..

mudler commented 7 months ago

Cannot replicate this. I'm testing with 201294ae177b308fb3a99dc504dd6d27e8afa907 on my Intel Arc a770 and everything works as expected

aahouzi commented 7 months ago

The device selection is with issue when there are igpu & Arc GPU in same PC. It has been fixed in multiple GPU support feature. But this feature is ongoing and not merged.

I think my issue is probably related to what @NeoZhangJianyu mentioned above. @mudler are u in the same setting ?

mudler commented 7 months ago

The device selection is with issue when there are igpu & Arc GPU in same PC. It has been fixed in multiple GPU support feature. But this feature is ongoing and not merged.

I think my issue is probably related to what @NeoZhangJianyu mentioned above. @mudler are u in the same setting ?

I have an AMD CPU.

aahouzi commented 7 months ago

@NeoZhangJianyu @airMeng I tried on 2 other systems, each one having an A770/A770M with Intel Iris Xe graphics igpu on Windows, and I successfully reproduced this issue. This needs deeper investigation to know what's going on here. All of our systems have an igpu, and this will become a blocker sooner or later..

Also, got access to an AMD Ryzen9 CPU with A770 card, and I can confirm it's running out of the box without this issue.

NeoZhangJianyu commented 6 months ago

@aahouzi Could you try with latest code? The multiple cards support is merged.

aahouzi commented 6 months ago

@NeoZhangJianyu I saw u created a revert PR, is #5901 merged or there is no change yet ?

airMeng commented 6 months ago

@NeoZhangJianyu I saw u created a revert PR, is #5901 merged or there is no change yet ?

It is merged by mistake. the author will re-implement it with different methods but for same effects. You can try 5901 locally.

aahouzi commented 6 months ago

@airMeng I think I'll wait until it's re-implemented

sgwhat commented 6 months ago

Hi all @airMeng @NeoZhangJianyu, I also get similar trouble with this issue.

GPU device: Arc 770 System: Ubuntu

sycl-ls
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:cpu:1] Intel(R) OpenCL, 13th Gen Intel(R) Core(TM) i9-13900K OpenCL 3.0 (Build 0) [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics OpenCL 3.0 NEO  [23.35.27191.42]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.27191]

I used ollama to run llama.cpp sycl inference with this PR https://github.com/ollama/ollama/pull/2458/files, but I got error below:

ollama run example "What is your favourite condiment?"
 !##"##!       "!▅

        ▅
 "! $   #"# ##  ▅"#!

It's wired that this PR works only on Archlinux, but cannot work as well on Ubuntu, could you please give me some debugging advice?

airMeng commented 6 months ago

@aahouzi @sgwhat Can you try https://github.com/ggerganov/llama.cpp/pull/6006?

sgwhat commented 6 months ago

@aahouzi @sgwhat Can you try #6006?

Hi @airMeng , it's still not work..., I think my bug is really wired (it could work well on arch linux but not ubuntu) ☹

time=2024-03-12T20:06:03.311+08:00 level=INFO source=routes.go:1021 msg="Listening on 127.0.0.1:11434 (version 0.0.0)"
time=2024-03-12T20:06:03.311+08:00 level=INFO source=payload_common.go:107 msg="Extracting dynamic libraries..."
time=2024-03-12T20:06:03.346+08:00 level=INFO source=payload_common.go:146 msg="Dynamic LLM libraries [cpu_avx cpu_avx2 oneapi cpu]"
time=2024-03-12T20:06:03.346+08:00 level=INFO source=gpu.go:105 msg="Detecting GPU type"
time=2024-03-12T20:06:03.346+08:00 level=INFO source=gpu.go:285 msg="Searching for GPU management library libnvidia-ml.so"
time=2024-03-12T20:06:03.347+08:00 level=INFO source=gpu.go:331 msg="Discovered GPU libraries: []"
time=2024-03-12T20:06:03.347+08:00 level=INFO source=gpu.go:285 msg="Searching for GPU management library librocm_smi64.so"
time=2024-03-12T20:06:03.347+08:00 level=INFO source=gpu.go:331 msg="Discovered GPU libraries: []"
time=2024-03-12T20:06:03.347+08:00 level=INFO source=gpu.go:285 msg="Searching for GPU management library libze_intel_gpu.so"
time=2024-03-12T20:06:03.350+08:00 level=INFO source=gpu.go:331 msg="Discovered GPU libraries: [/usr/lib/x86_64-linux-gnu/libze_intel_gpu.so.1.3.27191.42]"
time=2024-03-12T20:06:03.358+08:00 level=INFO source=gpu.go:130 msg="Intel GPU detected"
time=2024-03-12T20:06:03.358+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
[GIN] 2024/03/12 - 20:06:08 | 200 |      18.859µs |       127.0.0.1 | HEAD     "/"
[GIN] 2024/03/12 - 20:06:10 | 200 |      30.154µs |       127.0.0.1 | HEAD     "/api/blobs/sha256:8bd3a3006c4f7aace054efdd717e4b86a05b521a83dc460a9640d7b2f179bf09"
[GIN] 2024/03/12 - 20:06:13 | 200 |  3.154779213s |       127.0.0.1 | POST     "/api/create"
[GIN] 2024/03/12 - 20:06:24 | 200 |      10.269µs |       127.0.0.1 | HEAD     "/"
[GIN] 2024/03/12 - 20:06:24 | 200 |     120.482µs |       127.0.0.1 | POST     "/api/show"
time=2024-03-12T20:06:24.727+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-03-12T20:06:24.727+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-03-12T20:06:24.727+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
loading library /tmp/ollama2419540237/oneapi/libext_server.so
time=2024-03-12T20:06:24.816+08:00 level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: /tmp/ollama2419540237/oneapi/libext_server.so"
time=2024-03-12T20:06:24.816+08:00 level=INFO source=dyn_ext_server.go:150 msg="Initializing llama server"
ggml_init_sycl: GGML_SYCL_DEBUG: 0
ggml_init_sycl: GGML_SYCL_F16: no
found 4 SYCL devices:
|  |                  |                                             |compute   |Max compute|Max work|Max sub|               |
|ID|       Device Type|                                         Name|capability|units      |group   |group  |Global mem size|
|--|------------------|---------------------------------------------|----------|-----------|--------|-------|---------------|
| 0|[level_zero:gpu:0]|               Intel(R) Arc(TM) A770 Graphics|       1.3|        512|    1024|     32|    16225243136|
| 1|    [opencl:gpu:0]|               Intel(R) Arc(TM) A770 Graphics|       3.0|        512|    1024|     32|    16225243136|
| 2|    [opencl:cpu:0]|         13th Gen Intel(R) Core(TM) i9-13900K|       3.0|         32|    8192|     64|    67143290880|
| 3|    [opencl:acc:0]|               Intel(R) FPGA Emulation Device|       1.2|         32|67108864|     64|    67143290880|
llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from /home/arda/.ollama/models/blobs/sha256:8bd3a3006c4f7aace054efdd717e4b86a05b521a83dc460a9640d7b2f179bf09 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 2
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  17:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  18:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  19:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  20:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 6.74 B
llm_load_print_meta: model size       = 3.56 GiB (4.54 BPW) 
llm_load_print_meta: general.name     = LLaMA v2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
ggml_backend_sycl_set_mul_device_mode: true
detect 1 SYCL GPUs: [0] with top Max compute units:512
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
llm_load_tensors: ggml ctx size =    0.22 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:      SYCL0 buffer size =  3577.56 MiB
llm_load_tensors:        CPU buffer size =    70.31 MiB
..................................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      SYCL0 KV buffer size =  1024.00 MiB
llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
llama_new_context_with_model:  SYCL_Host input buffer size   =    13.02 MiB
llama_new_context_with_model:      SYCL0 compute buffer size =   164.00 MiB
llama_new_context_with_model:  SYCL_Host compute buffer size =     8.00 MiB
llama_new_context_with_model: graph splits (measure): 2

NeoZhangJianyu commented 6 months ago

@sgwhat The log above is not whole. Here is a delay to load code. Please wait for 1-2 mins.

NeoZhangJianyu commented 6 months ago

Hi all @airMeng @NeoZhangJianyu, I also get similar trouble with this issue.

GPU device: Arc 770 System: Ubuntu
sycl-ls
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:cpu:1] Intel(R) OpenCL, 13th Gen Intel(R) Core(TM) i9-13900K OpenCL 3.0 (Build 0) [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics OpenCL 3.0 NEO  [23.35.27191.42]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.27191]
I used ollama to run llama.cpp sycl inference with this PR https://github.com/ollama/ollama/pull/2458/files, but I got error below:
ollama run example "What is your favourite condiment?"
 !##"##!       "!▅

        ▅
 "! $   #"# ##  ▅"#! 
It's wired that this PR works only on Archlinux, but cannot work as well on Ubuntu, could you please give me some debugging advice?

The result is due to the error of OPs. You could rebase with latest GGML lib in your project.

sgwhat commented 6 months ago

Hi all @airMeng @NeoZhangJianyu, I also get similar trouble with this issue.

GPU device: Arc 770 System: Ubuntu
sycl-ls

[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2023.16.12.0.12_195853.xmain-hotfix]

[opencl:cpu:1] Intel(R) OpenCL, 13th Gen Intel(R) Core(TM) i9-13900K OpenCL 3.0 (Build 0) [2023.16.12.0.12_195853.xmain-hotfix]

[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics OpenCL 3.0 NEO  [23.35.27191.42]

[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.27191]
I used ollama to run llama.cpp sycl inference with this PR https://github.com/ollama/ollama/pull/2458/files, but I got error below:
ollama run example "What is your favourite condiment?"

 !##"##!       "!▅

        ▅

 "! $   #"# ##  ▅"#! 
It's wired that this PR works only on Archlinux, but cannot work as well on Ubuntu, could you please give me some debugging advice?
The result is due to the error of OPs.

You could rebase with latest GGML lib in your project.

Sry to bother you, may I ask What is this OPs about? Is it latest GGML lib same as building the latest llama.cpp?

NeoZhangJianyu commented 6 months ago

@sgwhat

It's hard to know which OPs lead to the error result without deeply check.
We only support llama.cpp issue. Looks like your issue happen on project https://github.com/ollama/ollama. We are not familiar with your project.

Suggestion:

report a new issue for your case.
please run llama.cpp example according to the guide: https://github.com/ggerganov/llama.cpp/blob/master/README-sycl.md. It could be used to verify your hardware and software environment.
If step 2 is passed. Please rebase your project with the llama.cpp/ggml version which is verified in step 2. I guess your project should be passed too, if there is no more change of llama.cpp/ggml.
If step 2 is fault. Please provide whole log file and we will check the issue.

Thank you!

sgwhat commented 6 months ago

@sgwhat

It's hard to know which OPs lead to the error result without deeply check.

We only support llama.cpp issue. Looks like your issue happen on project https://github.com/ollama/ollama. We are not familiar with your project.

Suggestion:

report a new issue for your case.

please run llama.cpp example according to the guide: https://github.com/ggerganov/llama.cpp/blob/master/README-sycl.md. It could be used to verify your hardware and software environment.

If step 2 is passed. Please rebase your project with the llama.cpp/ggml version which is verified in step 2. I guess your project should be passed too, if there is no more change of llama.cpp/ggml.

If step 2 is fault. Please provide whole log file and we will check the issue.

Thank you!

I failed in step2, and I opened a new issue for it https://github.com/ggerganov/llama.cpp/issues/6036.

NeoZhangJianyu commented 6 months ago

@aahouzi Is your issue present with latest code?

aahouzi commented 6 months ago

@NeoZhangJianyu I'm tracking your PR, you still didn't merge #6073, so I don't think it will work.

I see that it's been merged, I will do my tests and keep you updated ;-)

aahouzi commented 6 months ago

@NeoZhangJianyu @airMeng I did my tests, it's working but not as it should be. I have an iGPU + A770M, and whatever id I pick for GGML_SYCL_DEVICE it always uses my A770M, and therefore I can't even use my iGPU if I want.
For example, here I want to use my iGPU which has id=0. When I pick it, it automatically goes to A770M instead. The bigger problem is that whatever id I pick it always go to A770M instead ^^'

C:\Users\Intel\Desktop\aahouzi\llama.cpp>set GGML_SYCL_DEVICE=0 && build\bin\main.exe -m llama-2-7b.Q4_0.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e -ngl 33 -s 0
Log start
main: build = 2447 (c47cf414)
main: built with IntelLLVM 2024.0.2 for
main: seed  = 0
ggml_init_sycl: GGML_SYCL_DEBUG: 0
ggml_init_sycl: GGML_SYCL_F16: no
found 6 SYCL devices:
|  |                  |                                             |Compute   |Max compute|Max work|Max sub|               |
|ID|       Device Type|                                         Name|capability|units      |group   |group  |Global mem size|
|--|------------------|---------------------------------------------|----------|-----------|--------|-------|---------------|
| 0|[level_zero:gpu:0]|                 Intel(R) Iris(R) Xe Graphics|       1.3|         96|     512|     32|     3097038848|
| 1|[level_zero:gpu:1]|              Intel(R) Arc(TM) A770M Graphics|       1.3|        512|    1024|     32|     3819835392|
| 2|    [opencl:gpu:0]|              Intel(R) Arc(TM) A770M Graphics|       3.0|        512|    1024|     32|     3819835392|
| 3|    [opencl:gpu:1]|                 Intel(R) Iris(R) Xe Graphics|       3.0|         96|     512|     32|     3097038848|
| 4|    [opencl:cpu:0]|         12th Gen Intel(R) Core(TM) i7-12700H|       3.0|         20|    8192|     64|     3846729728|
| 5|    [opencl:acc:0]|               Intel(R) FPGA Emulation Device|       1.2|         20|67108864|     64|     3846729728|
...
ggml_backend_sycl_set_mul_device_mode: true
+ detect 1 SYCL GPUs: [1] with top Max compute units:512 (A770M and not iGPU)
llm_load_tensors: ggml ctx size =    0.22 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:      SYCL1 buffer size =  3577.56 MiB
llm_load_tensors:        CPU buffer size =    70.31 MiB
..................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      SYCL1 KV buffer size =   256.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:  SYCL_Host  output buffer size =    62.50 MiB
llama_new_context_with_model:      SYCL1 compute buffer size =    70.50 MiB
llama_new_context_with_model:  SYCL_Host compute buffer size =     9.00 MiB
llama_new_context_with_model: graph splits: 2

system_info: n_threads = 10 / 20 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 512, n_batch = 2048, n_predict = 400, n_keep = 1

 Building a website can be done in 10 simple steps:
Step 1: Get Domain and Hosting
The first step is to get your domain name and hosting account. Your domain will serve as the address of your site, while hosting will provide you with space on which to build your site and make it available for people to visit. When you purchase hosting, you’ll also have access to other services like a website builder (which we recommend), a WordPress installer, an SSL certificate, etc.
Once you have these, the next step is to create a website using the tools provided by your web host or by purchasing a third-party site builder, such as Squarespace or Wix. You can do this yourself if you’d like, but we don’t recommend it unless you already know how to code. If not, hire someone who does!
Step 3: Create Your Website Content
The next step is to create your website content by writing text, uploading images and videos, or creating multimedia elements such as slideshows and music tracks (if applicable). This process usually takes about a month if done properly. Once you’ve created all the necessary components of your site—including graphics for headers/footers, menus, etc.—you can begin setting up navigation links between pages using HTML code or through an online tool such as WordPress (or both!).
Step 4: Optimize Your Site to Appear Higher in Search Results
The next step is to optimize your site so that it appears higher on search engines like Google when someone searches for information related to what you offer. This includes creating content that is optimized for SEO, making sure that each page has a meta description and keyword tags (if applicable), and ensuring that all images have alt text descriptions attached to them. You should also link out from other websites where appropriate—this helps build authority with search engines while simultaneously giving users relevant information about topics they might be interested in reading more about later on down
llama_print_timings:        load time =    8562.90 ms
llama_print_timings:      sample time =      47.18 ms /   400 runs   (    0.12 ms per token,  8478.53 tokens per second)
llama_print_timings: prompt eval time =     242.04 ms /    19 tokens (   12.74 ms per token,    78.50 tokens per second)
llama_print_timings:        eval time =   20663.94 ms /   399 runs   (   51.79 ms per token,    19.31 tokens per second)
llama_print_timings:       total time =   21107.17 ms /   418 tokens
Log end

NeoZhangJianyu commented 6 months ago

@aahouzi Good, above result approves work well. In last week, the bug of set GPU is fixed. Please use latest code.

To set the GPU, please refer to the script:

./examples/sycl/run-llama2.sh 0
./examples/sycl/run-llama2.sh 1

aahouzi commented 6 months ago

@NeoZhangJianyu I'm using the latest code, and the issue is still there ;)

NeoZhangJianyu commented 6 months ago

@aahouzi Could you provide the whole log including cmd?

aahouzi commented 6 months ago

@NeoZhangJianyu here is the whole log including cmd:

C:\Users\Intel\Desktop\aahouzi\llama.cpp>set GGML_SYCL_DEVICE=0 && build\bin\main.exe -m llama-2-7b.Q4_0.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e -ngl 33 -s 0
Log start
main: build = 2447 (c47cf414)
main: built with IntelLLVM 2024.0.2 for
main: seed  = 0
ggml_init_sycl: GGML_SYCL_DEBUG: 0
ggml_init_sycl: GGML_SYCL_F16: no
found 6 SYCL devices:
|  |                  |                                             |Compute   |Max compute|Max work|Max sub|               |
|ID|       Device Type|                                         Name|capability|units      |group   |group  |Global mem size|
|--|------------------|---------------------------------------------|----------|-----------|--------|-------|---------------|
| 0|[level_zero:gpu:0]|                 Intel(R) Iris(R) Xe Graphics|       1.3|         96|     512|     32|     3097038848|
| 1|[level_zero:gpu:1]|              Intel(R) Arc(TM) A770M Graphics|       1.3|        512|    1024|     32|     3819835392|
| 2|    [opencl:gpu:0]|              Intel(R) Arc(TM) A770M Graphics|       3.0|        512|    1024|     32|     3819835392|
| 3|    [opencl:gpu:1]|                 Intel(R) Iris(R) Xe Graphics|       3.0|         96|     512|     32|     3097038848|
| 4|    [opencl:cpu:0]|         12th Gen Intel(R) Core(TM) i7-12700H|       3.0|         20|    8192|     64|     3846729728|
| 5|    [opencl:acc:0]|               Intel(R) FPGA Emulation Device|       1.2|         20|67108864|     64|     3846729728|
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from llama-2-7b.Q4_0.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 2
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  17:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  18:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 6.74 B
llm_load_print_meta: model size       = 3.56 GiB (4.54 BPW)
llm_load_print_meta: general.name     = LLaMA v2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
ggml_backend_sycl_set_mul_device_mode: true
detect 1 SYCL GPUs: [1] with top Max compute units:512
llm_load_tensors: ggml ctx size =    0.22 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:      SYCL1 buffer size =  3577.56 MiB
llm_load_tensors:        CPU buffer size =    70.31 MiB
..................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      SYCL1 KV buffer size =   256.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:  SYCL_Host  output buffer size =    62.50 MiB
llama_new_context_with_model:      SYCL1 compute buffer size =    70.50 MiB
llama_new_context_with_model:  SYCL_Host compute buffer size =     9.00 MiB
llama_new_context_with_model: graph splits: 2

system_info: n_threads = 10 / 20 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 512, n_batch = 2048, n_predict = 400, n_keep = 1

 Building a website can be done in 10 simple steps:
Step 1: Get Domain and Hosting
The first step is to get your domain name and hosting account. Your domain will serve as the address of your site, while hosting will provide you with space on which to build your site and make it available for people to visit. When you purchase hosting, you’ll also have access to other services like a website builder (which we recommend), a WordPress installer, an SSL certificate, etc.
Once you have these, the next step is to create a website using the tools provided by your web host or by purchasing a third-party site builder, such as Squarespace or Wix. You can do this yourself if you’d like, but we don’t recommend it unless you already know how to code. If not, hire someone who does!
Step 3: Create Your Website Content
The next step is to create your website content by writing text, uploading images and videos, or creating multimedia elements such as slideshows and music tracks (if applicable). This process usually takes about a month if done properly. Once you’ve created all the necessary components of your site—including graphics for headers/footers, menus, etc.—you can begin setting up navigation links between pages using HTML code or through an online tool such as WordPress (or both!).
Step 4: Optimize Your Site to Appear Higher in Search Results
The next step is to optimize your site so that it appears higher on search engines like Google when someone searches for information related to what you offer. This includes creating content that is optimized for SEO, making sure that each page has a meta description and keyword tags (if applicable), and ensuring that all images have alt text descriptions attached to them. You should also link out from other websites where appropriate—this helps build authority with search engines while simultaneously giving users relevant information about topics they might be interested in reading more about later on down
llama_print_timings:        load time =    8855.14 ms
llama_print_timings:      sample time =      47.40 ms /   400 runs   (    0.12 ms per token,  8439.71 tokens per second)
llama_print_timings: prompt eval time =     250.88 ms /    19 tokens (   13.20 ms per token,    75.73 tokens per second)
llama_print_timings:        eval time =   20629.34 ms /   399 runs   (   51.70 ms per token,    19.34 tokens per second)
llama_print_timings:       total time =   21079.05 ms /   418 tokens
Log end

SergioVargasRamirez commented 4 months ago

Could you please give some details about your config. It seems I have a similar system but in my case it is not working... I am using opensuse tumbleweed.

thanks in advance,

Sergio

The device selection is with issue when there are igpu & Arc GPU in same PC. It has been fixed in multiple GPU support feature. But this feature is ongoing and not merged.

I think my issue is probably related to what @NeoZhangJianyu mentioned above. @mudler are u in the same setting ?

I have an AMD CPU.

github-actions[bot] commented 3 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.