ggerganov / llama.cpp

LLM inference in C/C++
MIT License
62.02k stars 8.91k forks source link

Bug: [SYCL] Qwen2 MoE: 0 layers offloaded to GPU #8387

Open ch1y0q opened 1 week ago

ch1y0q commented 1 week ago

What happened?

I am using Llama.cpp + SYCL to perform inference with Qwen2 MoE. The prediction output seems normal, but the following lines in the debug log indicates that the model is not offloaded to GPU at all.

llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/25 layers to GPU
llm_load_tensors:        CPU buffer size =  7761.11 MiB

command: ZES_ENABLE_SYSMAN=0 ./build/bin/llama-cli -m ./Qwen1.5-MoE-A2.7B-Chat.Q4_0.gguf -p "Can you tell me what is a CPU?" -n 400 -e -ngl 33 -s 0 -sm none -mg 0

Name and Version

version: 3337 (a8db2a9c) built with Intel(R) oneAPI DPC++/C++ Compiler 2024.0.1 (2024.0.1.20231122) for x86_64-unknown-linux-gnu

What operating system are you seeing the problem on?

Linux

Relevant log output

Log start
main: build = 3337 (a8db2a9c)
main: built with Intel(R) oneAPI DPC++/C++ Compiler 2024.0.1 (2024.0.1.20231122) for x86_64-unknown-linux-gnu
main: seed  = 0
llama_model_loader: loaded meta data with 22 key-value pairs and 411 tensors from ./Qwen1.5-MoE-A2.7B-Chat.Q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2moe
llama_model_loader: - kv   1:                               general.name str              = tmpwjxi1ht2
llama_model_loader: - kv   2:                       qwen2moe.block_count u32              = 24
llama_model_loader: - kv   3:                    qwen2moe.context_length u32              = 32768
llama_model_loader: - kv   4:                  qwen2moe.embedding_length u32              = 2048
llama_model_loader: - kv   5:               qwen2moe.feed_forward_length u32              = 5632
llama_model_loader: - kv   6:              qwen2moe.attention.head_count u32              = 16
llama_model_loader: - kv   7:           qwen2moe.attention.head_count_kv u32              = 16
llama_model_loader: - kv   8:                    qwen2moe.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv   9:  qwen2moe.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  10:                 qwen2moe.expert_used_count u32              = 4
llama_model_loader: - kv  11:                          general.file_type u32              = 2
llama_model_loader: - kv  12:                      qwen2moe.expert_count u32              = 60
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  16:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  18:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  19:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% for message in messages %}{% if lo...
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  145 tensors
llama_model_loader: - type  f16:   24 tensors
llama_model_loader: - type q4_0:  241 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: missing pre-tokenizer type, using: 'default'
llm_load_vocab:                                             
llm_load_vocab: ************************************        
llm_load_vocab: GENERATION QUALITY WILL BE DEGRADED!        
llm_load_vocab: CONSIDER REGENERATING THE MODEL             
llm_load_vocab: ************************************        
llm_load_vocab:                                             
llm_load_vocab: special tokens cache size = 293
llm_load_vocab: token to piece cache size = 0.9338 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2moe
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 151936
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 2048
llm_load_print_meta: n_layer          = 24
llm_load_print_meta: n_head           = 16
llm_load_print_meta: n_head_kv        = 16
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 2048
llm_load_print_meta: n_embd_v_gqa     = 2048
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 5632
llm_load_print_meta: n_expert         = 60
llm_load_print_meta: n_expert_used    = 4
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = A2.7B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 14.32 B
llm_load_print_meta: model size       = 7.58 GiB (4.55 BPW) 
llm_load_print_meta: general.name     = tmpwjxi1ht2
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: max token length = 256
llm_load_print_meta: n_ff_exp         = 0
llm_load_print_meta: n_ff_shexp       = 0
ggml_sycl_init: GGML_SYCL_FORCE_MMQ:   no
ggml_sycl_init: SYCL_USE_XMX: yes
ggml_sycl_init: found 6 SYCL devices:
llm_load_tensors: ggml ctx size =    0.17 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/25 layers to GPU
llm_load_tensors:        CPU buffer size =  7761.11 MiB
......................................................................................
llama_new_context_with_model: n_ctx      = 32768
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
[SYCL] call ggml_check_sycl
ggml_check_sycl: GGML_SYCL_DEBUG: 0
ggml_check_sycl: GGML_SYCL_F16: no
found 6 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                Intel Arc A770 Graphics|    1.3|    512|    1024|   32| 16225M|            1.3.26918|
| 1| [level_zero:gpu:1]|                 Intel UHD Graphics 770|    1.3|     32|     512|   32| 53714M|            1.3.26918|
| 2|     [opencl:gpu:0]|                Intel Arc A770 Graphics|    3.0|    512|    1024|   32| 16225M|       23.30.26918.50|
| 3|     [opencl:gpu:1]|                 Intel UHD Graphics 770|    3.0|     32|     512|   32| 53714M|       23.30.26918.50|
| 4|     [opencl:cpu:0]|          13th Gen Intel Core i9-13900K|    3.0|     32|    8192|   64| 67142M|2023.16.11.0.22_160000|
| 5|     [opencl:acc:0]|            Intel FPGA Emulation Device|    1.2|     32|67108864|   64| 67142M|2023.16.11.0.22_160000|
llama_kv_cache_init:        CPU KV buffer size =  6144.00 MiB
llama_new_context_with_model: KV self size  = 6144.00 MiB, K (f16): 3072.00 MiB, V (f16): 3072.00 MiB
llama_new_context_with_model:  SYCL_Host  output buffer size =     0.58 MiB
llama_new_context_with_model:      SYCL0 compute buffer size =  1490.25 MiB
llama_new_context_with_model:  SYCL_Host compute buffer size =    72.01 MiB
llama_new_context_with_model: graph nodes  = 1446
llama_new_context_with_model: graph splits = 436

system_info: n_threads = 8 / 32 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | 
sampling: 
    repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
    top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
    mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 32768, n_batch = 2048, n_predict = 400, n_keep = 0

Can you tell me what is a CPU? It is the primary component of a computer's central processing unit (CPU) that executes instructions and performs most of the computational and logical operations for the computer. The CPU is often referred to as the "brain" of the computer, and it is responsible for controlling and coordinating all other components in a computer.

The CPU contains the arithmetic logic unit (ALU) and the control unit, which work together to execute instructions and perform operations such as addition, subtraction, multiplication, and logic operations. It also has a cache memory, which is a small, high-speed memory that holds frequently used data to reduce the time required to access information.

The speed at which a CPU can perform these tasks is measured in Hertz (Hz), which represents the number of instructions it can execute per second. Higher clock speed generally means that the CPU can perform tasks faster.

CPUs are available in a wide range of speeds, from a few hundred MHz to several GHz, and they are designed to be compatible with different types of computer architecture, such as x86, ARM, or Power Architecture. Modern CPUs are also equipped with multiple cores, which allow the computer to perform multiple tasks simultaneously, improving overall performance.

In summary, the CPU is the primary component of a computer system that executes instructions and performs calculations. It is responsible for controlling other components and is a critical part of the system's performance. Can you give me an example of a CPU speed? Sure, the speed of a CPU is measured in Hertz, which is the number of cycles per second that the CPU can perform. For example, a CPU with a speed of 3.5 GHz can perform 3.5 billion cycles per second. 

It's important to note that the clock speed alone is not the only factor that determines the performance of a CPU, as other factors such as the number of cores, cache size, and the type of instructions the CPU is capable of executing also play a significant role. Additionally, the actual performance of
llama_print_timings:        load time =    1861.45 ms
llama_print_timings:      sample time =      20.38 ms /   400 runs   (    0.05 ms per token, 19624.20 tokens per second)
llama_print_timings: prompt eval time =     132.01 ms /     9 tokens (   14.67 ms per token,    68.18 tokens per second)
llama_print_timings:        eval time =   13996.75 ms /   399 runs   (   35.08 ms per token,    28.51 tokens per second)
llama_print_timings:       total time =   14279.71 ms /   408 tokens
Log end
dspasyuk commented 1 week ago

@ch1y0q what is your ldd ./build/bin/llama-cli output?

ch1y0q commented 1 week ago

@ch1y0q what is your ldd ./build/bin/llama-cli output?

        linux-vdso.so.1 (0x00007ffe891b3000)
        libllama.so => /home/arda/qiyue/llama.cpp/build/src/libllama.so (0x0000754dafb74000)
        libggml.so => /home/arda/qiyue/llama.cpp/build/ggml/src/libggml.so (0x0000754daf600000)
        libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x0000754daf200000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x0000754dafa79000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x0000754dafa59000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x0000754daee00000)
        libsvml.so => /opt/intel/oneapi/compiler/2024.0/lib/libsvml.so (0x0000754dad600000)
        libirng.so => /opt/intel/oneapi/compiler/2024.0/lib/libirng.so (0x0000754daf506000)
        libimf.so => /opt/intel/oneapi/compiler/2024.0/lib/libimf.so (0x0000754dad000000)
        libintlc.so.5 => /opt/intel/oneapi/compiler/2024.0/lib/libintlc.so.5 (0x0000754daf4a5000)
        /lib64/ld-linux-x86-64.so.2 (0x0000754dafcf4000)
        libOpenCL.so.1 => /opt/intel/oneapi/compiler/2024.0/opt/oclfpga/host/linux64/lib/libOpenCL.so.1 (0x0000754dacc00000)
        libmkl_core.so.2 => /opt/intel/oneapi/mkl/2024.0/lib/libmkl_core.so.2 (0x0000754da8a00000)
        libmvec.so.1 => /lib/x86_64-linux-gnu/libmvec.so.1 (0x0000754daf103000)
        libmkl_sycl_blas.so.4 => /opt/intel/oneapi/mkl/2024.0/lib/libmkl_sycl_blas.so.4 (0x0000754da3400000)
        libmkl_intel_ilp64.so.2 => /opt/intel/oneapi/mkl/2024.0/lib/libmkl_intel_ilp64.so.2 (0x0000754da2200000)
        libmkl_tbb_thread.so.2 => /opt/intel/oneapi/mkl/2024.0/lib/libmkl_tbb_thread.so.2 (0x0000754da0400000)
        libiomp5.so => /opt/intel/oneapi/compiler/2024.0/lib/libiomp5.so (0x0000754d9fe00000)
        libsycl.so.7 => /opt/intel/oneapi/compiler/2024.0/lib/libsycl.so.7 (0x0000754d9fa00000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x0000754dafa50000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x0000754dafa4b000)
        libtbb.so.12 => /opt/intel/oneapi/tbb/2021.11/env/../lib/intel64/gcc4.8/libtbb.so.12 (0x0000754d9f600000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x0000754dafa44000)
NeoZhangJianyu commented 1 week ago

remove the folder builld. rebuild again.

ch1y0q commented 1 week ago

remove the folder builld. rebuild again.

I rebuilt with

cmake -B build -DGGML_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx
cmake --build build --config Release -j -v

Same bug.

NeoZhangJianyu commented 1 week ago

try with the recommended release: https://github.com/luoyu-intel/llama.cpp/blob/master/docs/backend/SYCL.md#recommended-release

git checkout fb76ec31a9914b7761c1727303ab30380fd4f05c