ggerganov / llama.cpp

LLM inference in C/C++
MIT License
61.24k stars 8.75k forks source link

Issue with loading model on Intel Data Center GPU Max 1100 using CLBlast #4607

Closed kunger97 closed 5 months ago

kunger97 commented 6 months ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

Expected Behavior

load model and output normally

Current Behavior

runing ./main -m ~/Sakura-13B-LNovel-v0.9.0-Q4_K_M.gguf -i --color -p "Hello" -ngl 99 -n 32 -c 2048 -b 512 and then program freezes

Environment and Context

Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.

Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         52 bits physical, 57 bits virtual
  Byte Order:            Little Endian
CPU(s):                  224
  On-line CPU(s) list:   0-223
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) Platinum 8480+
    CPU family:          6
    Model:               143
    Thread(s) per core:  2
    Core(s) per socket:  56
    Socket(s):           2
    Stepping:            8
    CPU max MHz:         3800.0000
    CPU min MHz:         800.0000
    BogoMIPS:            4000.00
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmpe
                         rf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 c
                         at_l2 cdp_l3 invpcid_single intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma c
                         lflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp 
                         hwp_pkg_req avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr 
                         amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities
Virtualization features: 
  Virtualization:        VT-x
Caches (sum of all):     
  L1d:                   5.3 MiB (112 instances)
  L1i:                   3.5 MiB (112 instances)
  L2:                    224 MiB (112 instances)
  L3:                    210 MiB (2 instances)
NUMA:                    
  NUMA node(s):          2
  NUMA node0 CPU(s):     0-55,112-167
  NUMA node1 CPU(s):     56-111,168-223
Vulnerabilities:         
  Gather data sampling:  Not affected
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Not affected
  Retbleed:              Not affected
  Spec rstack overflow:  Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl and seccomp
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
  Srbds:                 Not affected
  Tsx async abort:       Not affected

Linux node-14 5.15.0-91-generic #101-Ubuntu SMP Tue Nov 14 13:30:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Python 3.9.16 :: Intel Corporation
GNU Make 4.3
g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

Failure Information (for bugs)

program freezes, not crash

Steps to Reproduce

Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.

  1. load intel oneapi (setvars.sh)
  2. Compile the latest mainline code(cmake .. -DLLAMA_CLBLAST=ON and build)
  3. run ./main -m ~/Sakura-13B-LNovel-v0.9.0-Q4_K_M.gguf -i --color -p "Hello" -ngl 99 -n 32 -c 2048 -b 512
  4. runing and output llm_load_tensors: offloaded 41/41 layers to GPU then freezes

Failure Logs

Please include any relevant log snippets or files. If it works under one configuration but not under another, please provide logs for both configurations and their corresponding outputs so it is easy to see where behavior changes.

Also, please try to avoid using screenshots if at all possible. Instead, copy/paste the console output and use Github's markdown to cleanly format your logs for easy readability.

Example environment info:

$ git log | head -1
commit 7082d24cec35e9ce9147535a2224dfc67ee0a78c

$ pip list | egrep "torch|numpy|sentencepiece"
intel-extension-for-pytorch   2.0.110+xpu
numpy                         1.24.3
sentencepiece                 0.1.99
torch                         2.0.1a0+cxx11.abi
torchvision                   0.15.2a0+cxx11.abi

$ md5sum Sakura-13B-LNovel-v0.9.0-Q4_K_M.gguf
0ed031f12e9de84a6c01e177290a86fe  Sakura-13B-LNovel-v0.9.0-Q4_K_M.gguf

program log

$ ./main -m ~/Sakura-13B-LNovel-v0.9.0-Q4_K_M.gguf -i --color -p "Hello" -ngl 99 -n 32 -c 2048 -b 512
Log start
main: build = 1691 (7082d24)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1703300701
ggml_opencl: selecting platform: 'Intel(R) OpenCL Graphics'
ggml_opencl: selecting device: 'Intel(R) Data Center GPU Max 1100'
ggml_opencl: device FP16 support: true
llama_model_loader: loaded meta data with 19 key-value pairs and 323 tensors from /home/u22f390a763ad8fc99b0d55cf8c167d0/Sakura-13B-LNovel-v0.9.0-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen
llama_model_loader: - kv   1:                               general.name str              = Qwen
llama_model_loader: - kv   2:                        qwen.context_length u32              = 8192
llama_model_loader: - kv   3:                           qwen.block_count u32              = 40
llama_model_loader: - kv   4:                      qwen.embedding_length u32              = 5120
llama_model_loader: - kv   5:                   qwen.feed_forward_length u32              = 27392
llama_model_loader: - kv   6:                        qwen.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv   7:                  qwen.rope.dimension_count u32              = 128
llama_model_loader: - kv   8:                  qwen.attention.head_count u32              = 40
llama_model_loader: - kv   9:      qwen.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  10:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  11:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  12:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  13:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  14:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  15:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  16:            tokenizer.ggml.unknown_token_id u32              = 151643
llama_model_loader: - kv  17:               general.quantization_version u32              = 2
llama_model_loader: - kv  18:                          general.file_type u32              = 15
llama_model_loader: - type  f32:  121 tensors
llama_model_loader: - type q5_0:   20 tensors
llama_model_loader: - type q8_0:   20 tensors
llama_model_loader: - type q4_K:  121 tensors
llama_model_loader: - type q5_K:   40 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens definition check successful ( 421/152064 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 152064
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 5120
llm_load_print_meta: n_head           = 40
llm_load_print_meta: n_head_kv        = 40
llm_load_print_meta: n_layer          = 40
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 27392
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 13B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 14.17 B
llm_load_print_meta: model size       = 8.79 GiB (5.33 BPW) 
llm_load_print_meta: general.name     = Qwen
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: UNK token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_tensors: ggml ctx size       =    0.12 MiB
llm_load_tensors: using OpenCL for GPU acceleration
llm_load_tensors: system memory used  =  417.78 MiB
llm_load_tensors: VRAM used           = 8588.01 MiB
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 41/41 layers to GPU

other info

$ clinfo -l
Platform #0: Intel(R) FPGA Emulation Platform for OpenCL(TM)
 `-- Device #0: Intel(R) FPGA Emulation Device
Platform #1: Intel(R) OpenCL Graphics
 +-- Device #0: Intel(R) Data Center GPU Max 1100
 +-- Device #1: Intel(R) Data Center GPU Max 1100
 +-- Device #2: Intel(R) Data Center GPU Max 1100
 `-- Device #3: Intel(R) Data Center GPU Max 1100
Platform #2: Intel(R) OpenCL
 `-- Device #0: Intel(R) Xeon(R) Platinum 8480+

xpu info when program running

$ xpu-smi ps
PID       Command             DeviceID       SHR            MEM            
560180    xpu-smi             0              0              2293           
560174    main                0              0              646316         
560180    xpu-smi             1              0              2293           
560174    main                1              0              393            
560180    xpu-smi             2              0              2293           
560174    main                2              0              393            
560180    xpu-smi             3              0              2293           
560174    main                3              0              393            
$ ldd main
        linux-vdso.so.1 (0x0000148305550000)
        libOpenCL.so.1 => /opt/intel/oneapi/compiler/2023.2.1/linux/lib/libOpenCL.so.1 (0x0000148304f7e000)
        libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x0000148304d48000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x0000148304c61000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x0000148304c41000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x0000148304a19000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x0000148304a14000)
        libsvml.so => /opt/intel/oneapi/compiler/2023.2.1/linux/compiler/lib/intel64_lin/libsvml.so (0x00001483033e3000)
        libirng.so => /opt/intel/oneapi/compiler/2023.2.1/linux/compiler/lib/intel64_lin/libirng.so (0x0000148303000000)
        libimf.so => /opt/intel/oneapi/compiler/2023.2.1/linux/compiler/lib/intel64_lin/libimf.so (0x0000148302c16000)
        libintlc.so.5 => /opt/intel/oneapi/compiler/2023.2.1/linux/compiler/lib/intel64_lin/libintlc.so.5 (0x000014830336b000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x0000148303366000)
        /lib64/ld-linux-x86-64.so.2 (0x0000148305552000)

I am willing to provide other necessary information, please feel free to ask me

JohnnyOpcode commented 6 months ago

This is interesting. Running llama.cpp on a Intel(R) Xeon(R) Platinum 8480 (4th Generation) along with ARC-class GPU(s). I'm still researching the best approach to adding a SYCL backend to GGML which may help in this sort of hardware setup.

https://github.com/JohnnyOpcode/ggml-sycl

ggerganov commented 6 months ago

Clarify what "freezes" means - do you see any CPU / GPU usage?

Try adding -t 8 -tb 8 to the command-line

kunger97 commented 6 months ago

Sorry, I'm not a native English speaker, and the term 'freezes' might not be accurate. After running the command (I've also tried the suggested -t 8 -tb 8), the program output stops at 'llm_load_tensors: offloaded 41/41 layers to GPU' for a long time (possibly more than 30 minutes). At this point, using the htop tool, it can be observed that two threads (processes) are running, with one process occupying 100% of the CPU. On the GPU side, GPU with ID 0 is using approximately 615MiB of VRAM, but the GPU frequency is 0. image image

ggerganov commented 6 months ago

If you remove -i does the program finish successfully?

kunger97 commented 6 months ago

I attempted to run the following command ./main -m ~/gguf/Sakura-13B-LNovel-v0.9.0-Q4_K_M.gguf -p "Hello" -ngl 99 -t 8 -tb 8, but it seems there is no change compared to the previous run. The output still stops after 'llm_load_tensors: offloaded 41/41 layers to GPU.'(i wait for about 20 min) It appears that the model hasn't (completely) loaded into the VRAM.

kunger97 commented 5 months ago

work with sycl backend.