ggerganov / llama.cpp

LLM inference in C/C++
MIT License
65.07k stars 9.33k forks source link

Bug: Lower performance in SYCL vs IPEX LLM. #9505

Open adi-lb-phoenix opened 2 days ago

adi-lb-phoenix commented 2 days ago

What happened?

We expected to see similar performance in llama.cpp when compared to ipex-llm. But llama.cpp was almost two times slower than ipex-llm given all the parameters were the same.

Result from ipex-llm:

llama_print_timings:        load time =    7797.13 ms
llama_print_timings:      sample time =      30.64 ms /   400 runs   (    0.08 ms per token, 13055.26 tokens per second)
llama_print_timings: prompt eval time =    1322.78 ms /    13 tokens (  101.75 ms per token,     9.83 tokens per second)
llama_print_timings:        eval time =   11301.98 ms /   399 runs   (   28.33 ms per token,    35.30 tokens per second)
llama_print_timings:       total time =   12711.93 ms /   412 tokens

Below is the result from llama.cpp

llama_perf_sampler_print:    sampling time =      31.73 ms /   413 runs   (    0.08 ms per token, 13015.66 tokens per second)
llama_perf_context_print:        load time =    4317.89 ms
llama_perf_context_print: prompt eval time =     456.68 ms /    13 tokens (   35.13 ms per token,    28.47 tokens per second)
llama_perf_context_print:        eval time =   22846.95 ms /   399 runs   (   57.26 ms per token,    17.46 tokens per second)
llama_perf_context_print:       total time =   23379.98 ms /   412 tokens

Name and Version

the below output is from llama.cpp ./build/bin/llama-cli --version version: 3769 (d54c21df) built with Intel(R) oneAPI DPC++/C++ Compiler 2024.2.1 (2024.2.1.20240711) for x86_64-unknown-linux-gnu

The below output is from ipexllm: /llama-cli --version version: 1 (ce3a83b) built with Intel(R) oneAPI DPC++/C++ Compiler 2024.0.0 (2024.0.0.20231017) for x86_64-unknown-linux-gnu

What operating system are you seeing the problem on?

No response

Relevant log output

The below output is from llama.cpp 

 ZES_ENABLE_SYSMAN=1 ./build/bin/llama-cli -m models/Meta-Llama-3-8B-Instruct.Q8_0.gguf  -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 33 -sm none -mg 0 -ub 2048
build: 3769 (d54c21df) with Intel(R) oneAPI DPC++/C++ Compiler 2024.2.1 (2024.2.1.20240711) for x86_64-unknown-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_loader: loaded meta data with 27 key-value pairs and 291 tensors from models/Meta-Llama-3-8B-Instruct.Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Models
llama_model_loader: - kv   3:                         general.size_label str              = 8.0B
llama_model_loader: - kv   4:                            general.license str              = llama3
llama_model_loader: - kv   5:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   6:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv   7:                          llama.block_count u32              = 32
llama_model_loader: - kv   8:                       llama.context_length u32              = 8192
llama_model_loader: - kv   9:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv  10:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv  11:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  12:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  13:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  14:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  15:                          general.file_type u32              = 7
llama_model_loader: - kv  16:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  17:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  18:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  19:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  20:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  21:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  22:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  23:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  24:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  25:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  26:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q8_0:  226 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.8000 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 7.95 GiB (8.50 BPW) 
llm_load_print_meta: general.name     = Models
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
ggml_sycl_init: GGML_SYCL_FORCE_MMQ:   no
ggml_sycl_init: SYCL_USE_XMX: yes
ggml_sycl_init: found 2 SYCL devices:
llm_load_tensors: ggml ctx size =    0.27 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:      SYCL0 buffer size =  7605.33 MiB
llm_load_tensors:        CPU buffer size =   532.31 MiB
.........................................................................................
[SYCL] call ggml_check_sycl
ggml_check_sycl: GGML_SYCL_DEBUG: 0
ggml_check_sycl: GGML_SYCL_F16: no
found 2 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 2048
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
| 0| [level_zero:gpu:0]|                Intel Arc A770 Graphics|    1.3|    512|    1024|   32| 16225M|            1.3.29735|
| 1| [level_zero:gpu:1]|                 Intel UHD Graphics 730|    1.3|     32|     512|   32| 30897M|            1.3.29735|
llama_kv_cache_init:      SYCL0 KV buffer size =  1024.00 MiB
llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
llama_new_context_with_model:  SYCL_Host  output buffer size =     0.49 MiB
llama_new_context_with_model:      SYCL0 compute buffer size =  2240.02 MiB
llama_new_context_with_model:  SYCL_Host compute buffer size =    96.02 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 2
llama_init_from_gpt_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 6

system_info: n_threads = 6 (n_threads_batch = 6) / 12 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 

sampler seed: 2963972981
sampler params: 
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> top-k -> tail-free -> typical -> top-p -> min-p -> temp-ext -> softmax -> dist 
generate: n_ctx = 8192, n_batch = 2048, n_predict = 400, n_keep = 1

Building a website can be done in 10 simple steps: Plan, Design, Build, Test, Launch, Maintain, Optimize, Monitor, Update, and Refine. Here's a detailed guide to help you get started.

Step 1: Plan

* Define your website's purpose and goals.
* Identify your target audience and their needs.
* Research your competitors and their websites.
* Create a sitemap and wireframe to visualize your website's structure and layout.

Step 2: Design

* Create a visual design concept for your website, including color scheme, typography, and imagery.
* Design a custom logo and icon for your website.
* Create a consistent visual design style throughout your website.

Step 3: Build

* Choose a website building platform or CMS (Content Management System) that best fits your needs.
* Create a functional prototype of your website using HTML, CSS, and JavaScript.
* Implement responsive design to ensure your website is accessible on various devices and screen sizes.

Step 4: Test

* Conduct usability testing to identify and fix any usability issues.
* Test your website for compatibility with different browsers and devices.
* Check your website for accessibility and compliance with web standards.

Step 5: Launch

* Launch your website and make it live for the public to access.
* Configure your website's settings, such as analytics and search engine optimization (SEO).

Step 6: Maintain

* Regularly update your website's content to keep it fresh and relevant.
* Monitor your website's performance and fix any issues that arise.
* Back up your website's data and files regularly to ensure data integrity.

Step 7: Optimize

* Optimize your website's performance, speed, and security.
* Conduct A/B testing and analytics to improve your website's conversion rates.
* Continuously monitor and improve your website's search engine ranking.

Step 8: Monitor

* Monitor your website's traffic and engagement metrics to identify trends and areas for improvement.
* Track your website's conversion rates and make data-driven decisions to

llama_perf_sampler_print:    sampling time =      31.73 ms /   413 runs   (    0.08 ms per token, 13015.66 tokens per second)
llama_perf_context_print:        load time =    4317.89 ms
llama_perf_context_print: prompt eval time =     456.68 ms /    13 tokens (   35.13 ms per token,    28.47 tokens per second)
llama_perf_context_print:        eval time =   22846.95 ms /   399 runs   (   57.26 ms per token,    17.46 tokens per second)
llama_perf_context_print:       total time =   23379.98 ms /   412 tokens

the below output is from ipex-llm.
ZES_ENABLE_SYSMAN=1 ./llama-cli -m ~/llama.cpp/models/Meta-Llama-3-8B-Instruct.Q8_0.gguf  -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 33 -sm none -mg 0
Log start
main: build = 1 (ce3a83b)
main: built with Intel(R) oneAPI DPC++/C++ Compiler 2024.0.0 (2024.0.0.20231017) for x86_64-unknown-linux-gnu
main: seed  = 1726477650
llama_model_loader: loaded meta data with 27 key-value pairs and 291 tensors from /home/adithya.bhat/llama.cpp/models/Meta-Llama-3-8B-Instruct.Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Models
llama_model_loader: - kv   3:                         general.size_label str              = 8.0B
llama_model_loader: - kv   4:                            general.license str              = llama3
llama_model_loader: - kv   5:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   6:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv   7:                          llama.block_count u32              = 32
llama_model_loader: - kv   8:                       llama.context_length u32              = 8192
llama_model_loader: - kv   9:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv  10:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv  11:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  12:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  13:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  14:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  15:                          general.file_type u32              = 7
llama_model_loader: - kv  16:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  17:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  18:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  19:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  20:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  21:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  22:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  23:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  24:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  25:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  26:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q8_0:  226 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.8000 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 7.95 GiB (8.50 BPW) 
llm_load_print_meta: general.name     = Models
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
ggml_sycl_init: GGML_SYCL_FORCE_MMQ:   no
ggml_sycl_init: SYCL_USE_XMX: yes
ggml_sycl_init: found 2 SYCL devices:
llm_load_tensors: ggml ctx size =    0.27 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:      SYCL0 buffer size =  7605.33 MiB
llm_load_tensors:        CPU buffer size =   532.31 MiB
.........................................................................................
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 2048
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
[SYCL] call ggml_check_sycl
ggml_check_sycl: GGML_SYCL_DEBUG: 0
ggml_check_sycl: GGML_SYCL_F16: no
found 2 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                Intel Arc A770 Graphics|    1.3|    512|    1024|   32| 16225M|            1.3.29735|
| 1| [level_zero:gpu:1]|                 Intel UHD Graphics 730|    1.3|     32|     512|   32| 30897M|            1.3.29735|
llama_kv_cache_init:      SYCL0 KV buffer size =  1024.00 MiB
llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
llama_new_context_with_model:  SYCL_Host  output buffer size =     0.49 MiB
llama_new_context_with_model:      SYCL0 compute buffer size =  2240.02 MiB
llama_new_context_with_model:  SYCL_Host compute buffer size =    96.02 MiB
llama_new_context_with_model: graph nodes  = 1062
llama_new_context_with_model: graph splits = 2

system_info: n_threads = 6 / 12 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
sampling: 
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 8192, n_batch = 2048, n_predict = 400, n_keep = 1

Building a website can be done in 10 simple steps: Planning, Design, Prototyping, Content Creation, Website Structure, HTML and CSS, Images and Graphics, Testing, Launching, and Maintenance.
Planning: Determine the purpose of your website, identify your target audience, and create a sitemap to organize your content.
Design: Create a visual representation of your website using a wireframe or a design tool like Sketch or Figma.
Prototyping: Build a functional version of your website using a prototyping tool like InVision or Adobe XD.
Content Creation: Write and design the text, images, and other content for your website.
Website Structure: Organize your content using a hierarchical structure and create a navigation menu.
HTML and CSS: Write the code for your website's structure and styling using HTML and CSS.
Images and Graphics: Optimize and compress your images and graphics for faster loading times.
Testing: Test your website for functionality, usability, and compatibility with different devices and browsers.
Launching: Upload your website to a hosting server and make it available to the public.
Maintenance: Update your website regularly with new content, fix broken links, and monitor analytics to improve performance. Building a website can be a complex process, but breaking it down into these 10 simple steps can help make it more manageable. By following these steps, you can create a professional-looking website that effectively communicates your message to your target audience. Building a website can be a complex process, but breaking it down into these 10 simple steps can help make it more manageable. By following these steps, you can create a professional-looking website that effectively communicates your message to your target audience. Building a website can be a complex process, but breaking it down into these 10 simple steps can help make it more manageable. By following these steps, you can create a professional-looking website that effectively communicates your message to your target audience. Building a website can be a complex process, but breaking it down into these 10 simple steps can help make it more manageable. By following these steps,
llama_print_timings:        load time =    7797.13 ms
llama_print_timings:      sample time =      30.64 ms /   400 runs   (    0.08 ms per token, 13055.26 tokens per second)
llama_print_timings: prompt eval time =    1322.78 ms /    13 tokens (  101.75 ms per token,     9.83 tokens per second)
llama_print_timings:        eval time =   11301.98 ms /   399 runs   (   28.33 ms per token,    35.30 tokens per second)
llama_print_timings:       total time =   12711.93 ms /   412 tokens
adi-lb-phoenix commented 2 days ago

Command used to start the server in parallel mode

./build/bin/llama-server -m models/Meta-Llama-3-8B-Instruct.Q8_0.gguf   --port 9000 --parallel 4

4 tabs were opened and a same question was feed to the server . The output contained no garbage values as that emitted by ipex-llm. But it definetly was slow.

This is log of all the time simultaneous request were sent from four different tabs.

prompt eval time =    3676.79 ms /    52 tokens (   70.71 ms per token,    14.14 tokens per second)
       eval time =   74529.49 ms /   285 tokens (  261.51 ms per token,     3.82 tokens per second)
      total time =   78206.28 ms /   337 tokens
request: POST /completion 127.0.0.1 200
slot      release: id  2 | task 4 | stop processing: n_past = 340, truncated = 0
slot print_timing: id  2 | task 4 | 
prompt eval time =    3676.45 ms /    52 tokens (   70.70 ms per token,    14.14 tokens per second)
       eval time =   76562.58 ms /   289 tokens (  264.92 ms per token,     3.77 tokens per second)
      total time =   80239.03 ms /   341 tokens
request: POST /completion 127.0.0.1 200
slot      release: id  0 | task 0 | stop processing: n_past = 427, truncated = 0
slot print_timing: id  0 | task 0 | 
prompt eval time =    3926.09 ms /    52 tokens (   75.50 ms per token,    13.24 tokens per second)
       eval time =  105014.22 ms /   376 tokens (  279.29 ms per token,     3.58 tokens per second)
      total time =  108940.31 ms /   428 tokens
request: POST /completion 127.0.0.1 200
slot      release: id  1 | task 2 | stop processing: n_past = 451, truncated = 0
slot print_timing: id  1 | task 2 | 
prompt eval time =    3451.28 ms /    52 tokens (   66.37 ms per token,    15.07 tokens per second)
       eval time =  106420.36 ms /   400 tokens (  266.05 ms per token,     3.76 tokens per second)
      total time =  109871.64 ms /   452 tokens
srv  update_slots: all slots are idle
request: POST /completion 127.0.0.1 200
slot launch_slot_: id  3 | task 405 | processing task
slot update_slots: id  3 | task 405 | tokenizing prompt, len = 1
slot update_slots: id  3 | task 405 | prompt tokenized, n_ctx_slot = 2048, n_keep = 0, n_prompt_tokens = 346
slot update_slots: id  3 | task 405 | kv cache rm [334, end)
slot update_slots: id  3 | task 405 | prompt processing progress, n_past = 346, n_tokens = 12, progress = 0.034682
slot update_slots: id  3 | task 405 | prompt done, n_past = 346, n_tokens = 12
slot launch_slot_: id  0 | task 415 | processing task
slot update_slots: id  0 | task 415 | tokenizing prompt, len = 1
slot update_slots: id  0 | task 415 | prompt tokenized, n_ctx_slot = 2048, n_keep = 0, n_prompt_tokens = 351
slot update_slots: id  0 | task 415 | kv cache rm [54, end)
slot update_slots: id  0 | task 415 | prompt processing progress, n_past = 351, n_tokens = 298, progress = 0.846154
slot update_slots: id  0 | task 415 | prompt done, n_past = 351, n_tokens = 298
slot launch_slot_: id  1 | task 417 | processing task
slot update_slots: id  1 | task 417 | tokenizing prompt, len = 1
slot update_slots: id  1 | task 417 | prompt tokenized, n_ctx_slot = 2048, n_keep = 0, n_prompt_tokens = 464
slot update_slots: id  1 | task 417 | kv cache rm [451, end)
slot update_slots: id  1 | task 417 | prompt processing progress, n_past = 464, n_tokens = 15, progress = 0.028017
slot update_slots: id  1 | task 417 | prompt done, n_past = 464, n_tokens = 15
slot      release: id  3 | task 405 | stop processing: n_past = 624, truncated = 0
slot print_timing: id  3 | task 405 | 
prompt eval time =     725.80 ms /    12 tokens (   60.48 ms per token,    16.53 tokens per second)
       eval time =  119621.87 ms /   279 tokens (  428.75 ms per token,     2.33 tokens per second)
      total time =  120347.67 ms /   291 tokens
request: POST /completion 127.0.0.1 200
slot      release: id  0 | task 415 | stop processing: n_past = 680, truncated = 0
slot print_timing: id  0 | task 415 | 
prompt eval time =    3852.78 ms /   297 tokens (   12.97 ms per token,    77.09 tokens per second)
       eval time =  130432.33 ms /   330 tokens (  395.25 ms per token,     2.53 tokens per second)
      total time =  134285.11 ms /   627 tokens
request: POST /completion 127.0.0.1 200
slot      release: id  1 | task 417 | stop processing: n_past = 820, truncated = 0
slot print_timing: id  1 | task 417 | 
prompt eval time =     895.55 ms /    13 tokens (   68.89 ms per token,    14.52 tokens per second)
       eval time =  135633.50 ms /   357 tokens (  379.93 ms per token,     2.63 tokens per second)
      total time =  136529.05 ms /   370 tokens
srv  update_slots: all slots are idle
request: POST /completion 127.0.0.1 200
slot launch_slot_: id  3 | task 775 | processing task
slot update_slots: id  3 | task 775 | tokenizing prompt, len = 1
slot update_slots: id  3 | task 775 | prompt tokenized, n_ctx_slot = 2048, n_keep = 0, n_prompt_tokens = 635
slot update_slots: id  3 | task 775 | kv cache rm [623, end)
slot update_slots: id  3 | task 775 | prompt processing progress, n_past = 635, n_tokens = 12, progress = 0.018898
slot update_slots: id  3 | task 775 | prompt done, n_past = 635, n_tokens = 12
slot launch_slot_: id  0 | task 783 | processing task
slot update_slots: id  0 | task 783 | tokenizing prompt, len = 1
slot update_slots: id  0 | task 783 | prompt tokenized, n_ctx_slot = 2048, n_keep = 0, n_prompt_tokens = 690
slot update_slots: id  0 | task 783 | kv cache rm [678, end)
slot update_slots: id  0 | task 783 | prompt processing progress, n_past = 690, n_tokens = 13, progress = 0.017391
slot update_slots: id  0 | task 783 | prompt done, n_past = 690, n_tokens = 13
slot launch_slot_: id  1 | task 790 | processing task
slot update_slots: id  1 | task 790 | tokenizing prompt, len = 1
slot update_slots: id  1 | task 790 | prompt tokenized, n_ctx_slot = 2048, n_keep = 0, n_prompt_tokens = 830
slot update_slots: id  1 | task 790 | kv cache rm [818, end)
slot update_slots: id  1 | task 790 | prompt processing progress, n_past = 830, n_tokens = 14, progress = 0.014458
slot update_slots: id  1 | task 790 | prompt done, n_past = 830, n_tokens = 14
slot launch_slot_: id  2 | task 794 | processing task
slot update_slots: id  2 | task 794 | tokenizing prompt, len = 1
slot update_slots: id  2 | task 794 | prompt tokenized, n_ctx_slot = 2048, n_keep = 0, n_prompt_tokens = 437
slot update_slots: id  2 | task 794 | kv cache rm [54, end)
slot update_slots: id  2 | task 794 | prompt processing progress, n_past = 437, n_tokens = 386, progress = 0.876430
slot update_slots: id  2 | task 794 | prompt done, n_past = 437, n_tokens = 386
slot      release: id  3 | task 775 | stop processing: n_past = 974, truncated = 0
slot print_timing: id  3 | task 775 | 
prompt eval time =     732.11 ms /    12 tokens (   61.01 ms per token,    16.39 tokens per second)
       eval time =   99757.08 ms /   340 tokens (  293.40 ms per token,     3.41 tokens per second)
      total time =  100489.18 ms /   352 tokens
request: POST /completion 127.0.0.1 200
slot      release: id  0 | task 783 | stop processing: n_past = 1089, truncated = 0
slot print_timing: id  0 | task 783 | 
prompt eval time =     858.03 ms /    12 tokens (   71.50 ms per token,    13.99 tokens per second)
       eval time =  126285.69 ms /   400 tokens (  315.71 ms per token,     3.17 tokens per second)
      total time =  127143.72 ms /   412 tokens
request: POST /completion 127.0.0.1 200
slot      release: id  1 | task 790 | stop processing: n_past = 1229, truncated = 0
slot print_timing: id  1 | task 790 | 
prompt eval time =     884.82 ms /    12 tokens (   73.74 ms per token,    13.56 tokens per second)
       eval time =  125687.10 ms /   400 tokens (  314.22 ms per token,     3.18 tokens per second)
      total time =  126571.93 ms /   412 tokens
request: POST /completion 127.0.0.1 200
slot      release: id  2 | task 794 | stop processing: n_past = 836, truncated = 0
slot print_timing: id  2 | task 794 | 
prompt eval time =    3891.03 ms /   383 tokens (   10.16 ms per token,    98.43 tokens per second)
       eval time =  121694.66 ms /   400 tokens (  304.24 ms per token,     3.29 tokens per second)
      total time =  125585.69 ms /   783 tokens
srv  update_slots: all slots are idle
request: POST /completion 127.0.0.1 200
slot launch_slot_: id  2 | task 1195 | processing task
slot update_slots: id  2 | task 1195 | tokenizing prompt, len = 1
slot update_slots: id  2 | task 1195 | prompt tokenized, n_ctx_slot = 2048, n_keep = 0, n_prompt_tokens = 849
slot update_slots: id  2 | task 1195 | kv cache rm [836, end)
slot update_slots: id  2 | task 1195 | prompt processing progress, n_past = 849, n_tokens = 13, progress = 0.015312
slot update_slots: id  2 | task 1195 | prompt done, n_past = 849, n_tokens = 13
slot launch_slot_: id  1 | task 1217 | processing task
slot update_slots: id  1 | task 1217 | tokenizing prompt, len = 1
slot update_slots: id  1 | task 1217 | prompt tokenized, n_ctx_slot = 2048, n_keep = 0, n_prompt_tokens = 1241
slot update_slots: id  1 | task 1217 | kv cache rm [1229, end)
slot update_slots: id  1 | task 1217 | prompt processing progress, n_past = 1241, n_tokens = 13, progress = 0.009670
slot update_slots: id  1 | task 1217 | prompt done, n_past = 1241, n_tokens = 13
slot launch_slot_: id  0 | task 1237 | processing task
slot update_slots: id  0 | task 1237 | tokenizing prompt, len = 1
slot update_slots: id  0 | task 1237 | prompt tokenized, n_ctx_slot = 2048, n_keep = 0, n_prompt_tokens = 1101
slot update_slots: id  0 | task 1237 | kv cache rm [1089, end)
slot update_slots: id  0 | task 1237 | prompt processing progress, n_past = 1101, n_tokens = 14, progress = 0.010899
slot update_slots: id  0 | task 1237 | prompt done, n_past = 1101, n_tokens = 14
slot launch_slot_: id  3 | task 1248 | processing task
slot update_slots: id  3 | task 1248 | tokenizing prompt, len = 1
slot update_slots: id  3 | task 1248 | prompt tokenized, n_ctx_slot = 2048, n_keep = 0, n_prompt_tokens = 984
slot update_slots: id  3 | task 1248 | kv cache rm [972, end)
slot update_slots: id  3 | task 1248 | prompt processing progress, n_past = 984, n_tokens = 15, progress = 0.012195
slot update_slots: id  3 | task 1248 | prompt done, n_past = 984, n_tokens = 15
slot      release: id  1 | task 1217 | stop processing: n_past = 1473, truncated = 0
slot print_timing: id  1 | task 1217 | 
prompt eval time =     934.52 ms /    12 tokens (   77.88 ms per token,    12.84 tokens per second)
       eval time =   71915.53 ms /   233 tokens (  308.65 ms per token,     3.24 tokens per second)
      total time =   72850.05 ms /   245 tokens
request: POST /completion 127.0.0.1 200
slot      release: id  0 | task 1237 | stop processing: n_past = 1389, truncated = 0
slot print_timing: id  0 | task 1237 | 
prompt eval time =     817.42 ms /    12 tokens (   68.12 ms per token,    14.68 tokens per second)
       eval time =   97669.57 ms /   289 tokens (  337.96 ms per token,     2.96 tokens per second)
      total time =   98486.99 ms /   301 tokens
request: POST /completion 127.0.0.1 200
slot      release: id  3 | task 1248 | stop processing: n_past = 1266, truncated = 0
slot print_timing: id  3 | task 1248 | 
prompt eval time =    1060.48 ms /    12 tokens (   88.37 ms per token,    11.32 tokens per second)
       eval time =   93208.52 ms /   283 tokens (  329.36 ms per token,     3.04 tokens per second)
      total time =   94269.00 ms /   295 tokens
request: POST /completion 127.0.0.1 200
slot      release: id  2 | task 1195 | stop processing: n_past = 1224, truncated = 0
slot print_timing: id  2 | task 1195 | 
prompt eval time =     760.21 ms /    13 tokens (   58.48 ms per token,    17.10 tokens per second)
       eval time =  119571.51 ms /   376 tokens (  318.01 ms per token,     3.14 tokens per second)
      total time =  120331.72 ms /   389 tokens
srv  update_slots: all slots are idle
request: POST /completion 127.0.0.1 200
slot launch_slot_: id  3 | task 1575 | processing task
slot update_slots: id  3 | task 1575 | tokenizing prompt, len = 1
slot update_slots: id  3 | task 1575 | prompt tokenized, n_ctx_slot = 2048, n_keep = 0, n_prompt_tokens = 1276
slot update_slots: id  3 | task 1575 | kv cache rm [1265, end)
slot update_slots: id  3 | task 1575 | prompt processing progress, n_past = 1276, n_tokens = 11, progress = 0.008621
slot update_slots: id  3 | task 1575 | prompt done, n_past = 1276, n_tokens = 11
slot launch_slot_: id  0 | task 1582 | processing task
slot update_slots: id  0 | task 1582 | tokenizing prompt, len = 1
slot update_slots: id  0 | task 1582 | prompt tokenized, n_ctx_slot = 2048, n_keep = 0, n_prompt_tokens = 1399
slot update_slots: id  0 | task 1582 | kv cache rm [1388, end)
slot update_slots: id  0 | task 1582 | prompt processing progress, n_past = 1399, n_tokens = 12, progress = 0.007863
slot update_slots: id  0 | task 1582 | prompt done, n_past = 1399, n_tokens = 12
slot launch_slot_: id  1 | task 1589 | processing task
slot update_slots: id  1 | task 1589 | tokenizing prompt, len = 1
slot update_slots: id  1 | task 1589 | prompt tokenized, n_ctx_slot = 2048, n_keep = 0, n_prompt_tokens = 1482
slot update_slots: id  1 | task 1589 | kv cache rm [1471, end)
slot update_slots: id  1 | task 1589 | prompt processing progress, n_past = 1482, n_tokens = 13, progress = 0.007422
slot update_slots: id  1 | task 1589 | prompt done, n_past = 1482, n_tokens = 13
slot launch_slot_: id  2 | task 1595 | processing task
slot update_slots: id  2 | task 1595 | tokenizing prompt, len = 1
slot update_slots: id  2 | task 1595 | prompt tokenized, n_ctx_slot = 2048, n_keep = 0, n_prompt_tokens = 1233
slot update_slots: id  2 | task 1595 | kv cache rm [1222, end)
slot update_slots: id  2 | task 1595 | prompt processing progress, n_past = 1233, n_tokens = 14, progress = 0.008921
slot update_slots: id  2 | task 1595 | prompt done, n_past = 1233, n_tokens = 14
slot      release: id  3 | task 1575 | stop processing: n_past = 1599, truncated = 0
slot print_timing: id  3 | task 1575 | 
prompt eval time =     897.99 ms /    11 tokens (   81.64 ms per token,    12.25 tokens per second)
       eval time =  118307.07 ms /   324 tokens (  365.15 ms per token,     2.74 tokens per second)
      total time =  119205.06 ms /   335 tokens
request: POST /completion 127.0.0.1 200
slot      release: id  2 | task 1595 | stop processing: n_past = 1589, truncated = 0
slot print_timing: id  2 | task 1595 | 
prompt eval time =    1068.49 ms /    11 tokens (   97.14 ms per token,    10.29 tokens per second)
       eval time =  134298.98 ms /   357 tokens (  376.19 ms per token,     2.66 tokens per second)
      total time =  135367.47 ms /   368 tokens
request: POST /completion 127.0.0.1 200
slot      release: id  0 | task 1582 | stop processing: n_past = 1780, truncated = 0
slot print_timing: id  0 | task 1582 | 
prompt eval time =     963.06 ms /    11 tokens (   87.55 ms per token,    11.42 tokens per second)
       eval time =  143733.25 ms /   382 tokens (  376.27 ms per token,     2.66 tokens per second)
      total time =  144696.31 ms /   393 tokens
request: POST /completion 127.0.0.1 200
slot      release: id  1 | task 1589 | stop processing: n_past = 1881, truncated = 0
slot print_timing: id  1 | task 1589 | 
prompt eval time =    1108.24 ms /    11 tokens (  100.75 ms per token,     9.93 tokens per second)
       eval time =  147575.40 ms /   400 tokens (  368.94 ms per token,     2.71 tokens per second)
      total time =  148683.64 ms /   411 tokens
srv  update_slots: all slots are idle
request: POST /completion 127.0.0.1 200
qnixsynapse commented 2 days ago

Just for the sake of curiosity, can you try building and testing with this patch please?

diff --git a/ggml/src/ggml-sycl.cpp b/ggml/src/ggml-sycl.cpp
index acef7c6d..009911ff 100644
--- a/ggml/src/ggml-sycl.cpp
+++ b/ggml/src/ggml-sycl.cpp
@@ -3496,8 +3496,12 @@ static void ggml_sycl_mul_mat(ggml_backend_sycl_context & ctx, const ggml_tensor

     bool use_mul_mat_vec_q =  ggml_is_quantized(src0->type)
         && src1->type == GGML_TYPE_F32 && dst->type == GGML_TYPE_F32
-        && src1->ne[1] <= MMVQ_MAX_BATCH_SIZE
-        && (ctx.stream()->get_backend() == sycl::backend::ext_oneapi_cuda || src1->ne[1] > MMVQ_MIN_BATCH_SIZE);
+        && src1->ne[1] <= MMVQ_MAX_BATCH_SIZE;
+        
+       
+    if (ctx.stream()->get_backend() == sycl::backend::ext_oneapi_cuda) {
+       use_mul_mat_vec_q = use_mul_mat_vec_q && (src1->ne[1] > MMVQ_MIN_BATCH_SIZE);
+    }

     bool use_mul_mat_q =  ggml_sycl_supports_mmq(src0->type)
         && src1->type == GGML_TYPE_F32 && dst->type == GGML_TYPE_F32;
adi-lb-phoenix commented 2 days ago

@qnixsynapse what exactly do the above changes do ?

qnixsynapse commented 2 days ago

Revert a change and move that specific change to Nvidia only. Please see #9088

adi-lb-phoenix commented 2 days ago

@qnixsynapse ` cd /home/adithya.bhat/llama.cpp/build/pocs/vdot && /usr/bin/cmake -E cmake_link_script CMakeFiles/llama-q8dot.dir/link.txt --verbose=1 /opt/intel/oneapi/compiler/2024.2/bin/icpx -O3 -DNDEBUG CMakeFiles/llama-vdot.dir/vdot.cpp.o -o ../../bin/llama-vdot -Wl,-rpath,/home/adithya.bhat/llama.cpp/build/src:/home/adithya.bhat/llama.cpp/build/ggml/src ../../common/libcommon.a ../../src/libllama.so ../../ggml/src/libggml.so gmake[2]: Leaving directory '/home/adithya.bhat/llama.cpp/build' gmake[2]: Leaving directory '/home/adithya.bhat/llama.cpp/build' [ 64%] Built target test-autorelease /opt/intel/oneapi/compiler/2024.2/bin/icpx -O3 -DNDEBUG CMakeFiles/llama-q8dot.dir/q8dot.cpp.o -o ../../bin/llama-q8dot -Wl,-rpath,/home/adithya.bhat/llama.cpp/build/src:/home/adithya.bhat/llama.cpp/build/ggml/src ../../common/libcommon.a ../../src/libllama.so ../../ggml/src/libggml.so [ 65%] Built target test-backend-ops [ 66%] Built target llama-batched-bench gmake[2]: Leaving directory '/home/adithya.bhat/llama.cpp/build' [ 68%] Built target test-log gmake[2]: Leaving directory '/home/adithya.bhat/llama.cpp/build' gmake[2]: Leaving directory '/home/adithya.bhat/llama.cpp/build' gmake[2]: Leaving directory '/home/adithya.bhat/llama.cpp/build' [ 69%] Built target llama-embedding [ 69%] Built target llama-export-lora gmake[2]: Leaving directory '/home/adithya.bhat/llama.cpp/build' [ 69%] Built target test-grammar-integration gmake[2]: Leaving directory '/home/adithya.bhat/llama.cpp/build' [ 70%] Built target llama-lookup-create [ 70%] Built target llama-gbnf-validator gmake[2]: Leaving directory '/home/adithya.bhat/llama.cpp/build' gmake[2]: Leaving directory '/home/adithya.bhat/llama.cpp/build' [ 70%] Built target llama-eval-callback gmake[2]: Leaving directory '/home/adithya.bhat/llama.cpp/build' [ 71%] Built target llama-bench gmake[2]: Leaving directory '/home/adithya.bhat/llama.cpp/build' gmake[2]: Leaving directory '/home/adithya.bhat/llama.cpp/build' gmake[2]: Leaving directory '/home/adithya.bhat/llama.cpp/build' gmake[2]: Leaving directory '/home/adithya.bhat/llama.cpp/build' [ 71%] Built target llama-gguf-split [ 72%] Built target test-llama-grammar [ 73%] Built target test-arg-parser [ 73%] Built target llama-save-load-state [ 74%] Built target test-json-schema-to-grammar gmake[2]: Leaving directory '/home/adithya.bhat/llama.cpp/build' gmake[2]: Leaving directory '/home/adithya.bhat/llama.cpp/build' [ 74%] Built target llama-gritlm [ 74%] Built target llama-cvector-generator gmake[2]: Leaving directory '/home/adithya.bhat/llama.cpp/build' gmake[2]: Leaving directory '/home/adithya.bhat/llama.cpp/build' gmake[2]: Leaving directory '/home/adithya.bhat/llama.cpp/build' gmake[2]: Leaving directory '/home/adithya.bhat/llama.cpp/build' gmake[2]: Leaving directory '/home/adithya.bhat/llama.cpp/build' [ 75%] Built target llama-llava-cli gmake[2]: Leaving directory '/home/adithya.bhat/llama.cpp/build' [ 76%] Built target llama-infill [ 77%] Built target test-quantize-perf gmake[2]: Leaving directory '/home/adithya.bhat/llama.cpp/build' gmake[2]: Leaving directory '/home/adithya.bhat/llama.cpp/build' [ 78%] Built target llama-lookup-stats gmake[2]: Leaving directory '/home/adithya.bhat/llama.cpp/build' [ 79%] Built target llama-lookup-merge gmake[2]: Leaving directory '/home/adithya.bhat/llama.cpp/build' gmake[2]: Leaving directory '/home/adithya.bhat/llama.cpp/build' [ 79%] Built target llama-quantize [ 79%] Built target llama-convert-llama2c-to-ggml [ 80%] Built target llama-simple [ 81%] Built target llama-lookup gmake[2]: Leaving directory '/home/adithya.bhat/llama.cpp/build' gmake[2]: Leaving directory '/home/adithya.bhat/llama.cpp/build' [ 81%] Built target llama-cli [ 82%] Built target llama-batched [ 82%] Built target llama-q8dot [ 91%] Built target llama-server gmake[2]: Leaving directory '/home/adithya.bhat/llama.cpp/build' [ 92%] Built target llama-perplexity gmake[2]: Leaving directory '/home/adithya.bhat/llama.cpp/build' [ 93%] Built target llama-minicpmv-cli gmake[2]: Leaving directory '/home/adithya.bhat/llama.cpp/build' [ 93%] Built target llama-imatrix gmake[2]: Leaving directory '/home/adithya.bhat/llama.cpp/build' [ 94%] Built target llama-ls-sycl-device gmake[2]: Leaving directory '/home/adithya.bhat/llama.cpp/build' gmake[2]: Leaving directory '/home/adithya.bhat/llama.cpp/build' [ 95%] Built target llama-tokenize [ 95%] Built target llama-retrieval gmake[2]: Leaving directory '/home/adithya.bhat/llama.cpp/build' gmake[2]: Leaving directory '/home/adithya.bhat/llama.cpp/build' [ 96%] Built target llama-passkey [ 97%] Built target llama-lookahead gmake[2]: Leaving directory '/home/adithya.bhat/llama.cpp/build' [ 98%] Built target llama-parallel gmake[2]: Leaving directory '/home/adithya.bhat/llama.cpp/build' gmake[2]: Leaving directory '/home/adithya.bhat/llama.cpp/build' [ 99%] Built target llama-speculative [100%] Built target llama-vdot gmake[1]: Leaving directory '/home/adithya.bhat/llama.cpp/build' /usr/bin/cmake -E cmake_progress_start /home/adithya.bhat/llama.cpp/build/CMakeFiles 0

`

qnixsynapse commented 2 days ago

@adi-lb-phoenix please check if the issue of lower performance still exists or not.

adi-lb-phoenix commented 2 days ago
ZES_ENABLE_SYSMAN=1 ./build/bin/llama-cli -m models/Meta-Llama-3-8B-Instruct.Q8_0.gguf  -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 33 -sm none -mg 0 -ub 2048
build: 3769 (d54c21df) with Intel(R) oneAPI DPC++/C++ Compiler 2024.2.1 (2024.2.1.20240711) for x86_64-unknown-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_loader: loaded meta data with 27 key-value pairs and 291 tensors from models/Meta-Llama-3-8B-Instruct.Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Models
llama_model_loader: - kv   3:                         general.size_label str              = 8.0B
llama_model_loader: - kv   4:                            general.license str              = llama3
llama_model_loader: - kv   5:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   6:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv   7:                          llama.block_count u32              = 32
llama_model_loader: - kv   8:                       llama.context_length u32              = 8192
llama_model_loader: - kv   9:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv  10:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv  11:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  12:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  13:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  14:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  15:                          general.file_type u32              = 7
llama_model_loader: - kv  16:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  17:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  18:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  19:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  20:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  21:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  22:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  23:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  24:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  25:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  26:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q8_0:  226 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.8000 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 7.95 GiB (8.50 BPW) 
llm_load_print_meta: general.name     = Models
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
ggml_sycl_init: GGML_SYCL_FORCE_MMQ:   no
ggml_sycl_init: SYCL_USE_XMX: yes
ggml_sycl_init: found 2 SYCL devices:
llm_load_tensors: ggml ctx size =    0.27 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:      SYCL0 buffer size =  7605.33 MiB
llm_load_tensors:        CPU buffer size =   532.31 MiB
.........................................................................................
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 2048
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
[SYCL] call ggml_check_sycl
ggml_check_sycl: GGML_SYCL_DEBUG: 0
ggml_check_sycl: GGML_SYCL_F16: no
found 2 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                Intel Arc A770 Graphics|    1.3|    512|    1024|   32| 16225M|            1.3.29735|
| 1| [level_zero:gpu:1]|                 Intel UHD Graphics 730|    1.3|     32|     512|   32| 30897M|            1.3.29735|
llama_kv_cache_init:      SYCL0 KV buffer size =  1024.00 MiB
llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
llama_new_context_with_model:  SYCL_Host  output buffer size =     0.49 MiB
llama_new_context_with_model:      SYCL0 compute buffer size =  2240.02 MiB
llama_new_context_with_model:  SYCL_Host compute buffer size =    96.02 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 2
llama_init_from_gpt_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 6

system_info: n_threads = 6 (n_threads_batch = 6) / 12 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 

sampler seed: 3505607311
sampler params: 
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> top-k -> tail-free -> typical -> top-p -> min-p -> temp-ext -> softmax -> dist 
generate: n_ctx = 8192, n_batch = 2048, n_predict = 400, n_keep = 1

Building a website can be done in 10 simple steps: 1. Plan your website's purpose and goals. 2. Choose a domain name and web hosting service. 3. Design the layout and structure of your website. 4. Create the content for your website, such as text, images, and videos. 5. Choose a content management system (CMS) or build a custom website using a programming language like HTML, CSS, and JavaScript. 6. Build the website's database and database schema. 7. Develop the website's functionality, such as login and registration forms, search functionality, and e-commerce integration. 8. Test the website for bugs and errors. 9. Launch the website and make it live for users. 10. Maintain and update the website regularly to ensure it remains secure, fast, and user-friendly.
Building a website can be done in 10 simple steps: 1. Plan your website's purpose and goals. 2. Choose a domain name and web hosting service. 3. Design the layout and structure of your website. 4. Create the content for your website, such as text, images, and videos. 5. Choose a content management system (CMS) or build a custom website using a programming language like HTML, CSS, and JavaScript. 6. Build the website's database and database schema. 7. Develop the website's functionality, such as login and registration forms, search functionality, and e-commerce integration. 8. Test the website for bugs and errors. 9. Launch the website and make it live for users. 10. Maintain and update the website regularly to ensure it remains secure, fast, and user-friendly.
Building a website can be done in 10 simple steps: 1. Plan your website's purpose and goals. 2. Choose a domain name and web hosting service. 3. Design the layout and structure of your website. 4. Create the content for your website, such as text, images

llama_perf_sampler_print:    sampling time =      31.78 ms /   413 runs   (    0.08 ms per token, 12994.37 tokens per second)
llama_perf_context_print:        load time =    6872.89 ms
llama_perf_context_print: prompt eval time =     451.75 ms /    13 tokens (   34.75 ms per token,    28.78 tokens per second)
llama_perf_context_print:        eval time =   22841.11 ms /   399 runs   (   57.25 ms per token,    17.47 tokens per second)
llama_perf_context_print:       total time =   23369.78 ms /   412 tokens

There was no significant change

qnixsynapse commented 2 days ago

Thank you for confirming. This isn't related to my issue.

edit: I saw 3 token/sec on your server test so I thought this maybe related.

adi-lb-phoenix commented 1 day ago

That was when I ran a server a executed tasks through four different tabs.

qnixsynapse commented 1 day ago

Yeah, I also get that when running a server. That revert fixes it in my testing.

adi-lb-phoenix commented 1 day ago

can you please share the logs, the test conditions, model used to test? I have used the model Meta-Llama-3-8B-Instruct.Q8_0.gguf.

qnixsynapse commented 1 day ago

I am using quantized models such as iq4_xs to test on my server. The master branch has no problem with fp16 or fp32 models. The PR I linked seems to be cause a regression in my case.

NeoZhangJianyu commented 1 day ago

@adi-lb-phoenix Thank you report this issue! Your test result is same as we know (the performance gap between llama.cpp and IPEX LLM).

We could have same performance of IPEX LLM. But we need time because all developers are working in spare time.

As I know, some developers are working for it. The performance of llama.cpp has increased about 20% in passed 5 months.

In passed half year, we focused on function and bug. Next, maybe we need to focus on performance.

image

ayttop commented 1 day ago

Anyone who has an Intel integrated VGA should try the koboldcpp_nocuda program and choose Vulkan. The integrated VGA will work without anything.

adi-lb-phoenix commented 20 hours ago

@NeoZhangJianyu , Thank you for the info. This is such a great tool. can you tag the contributor's working on this and possibly see if we can work on it to improve performance?