ggerganov / llama.cpp

LLM inference in C/C++
MIT License
67.35k stars 9.67k forks source link

SYCL build failed #5547

Closed DDXDB closed 5 months ago

DDXDB commented 8 months ago

The build can be completed, but after it is finished Run build\bin\main.exe

main: build = 0 (unknown)
main: built with MSVC 19.39.33519.0 for
main: seed  = 1708170078
llama_model_load: error loading model: failed to open models/7B/ggml-model-f16.gguf: No such file or directory
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model 'models/7B/ggml-model-f16.gguf'
main: error: unable to load model

Run .\examples\sycl\win-run-llama2.bat

:: oneAPI environment initialized ::
warning: not compiled with GPU offload support, --n-gpu-layers option will be ignored
warning: see main README.md for information on enabling GPU BLAS support

My PC: OS:Windows 11 (22631.3155) CPU:AMD Ryzen 5 5600X GPU:Intel Arc A770

Run sycl-ls

[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:cpu:1] Intel(R) OpenCL, AMD Ryzen 5 5600X 6-Core Processor              OpenCL 3.0 (Build 0) [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics OpenCL 3.0 NEO  [31.0.101.5330]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.28328]
Jacoby1218 commented 8 months ago

Your problem is that you need to specify a model file. It uses a default file, but doesn't come with it. use -m /path/to/model to specify where your model is located.

hungle-i3 commented 8 months ago

:: oneAPI environment initialized :: warning: not compiled with GPU offload support, --n-gpu-layers option will be ignored warning: see main README.md for information on enabling GPU BLAS support

This issue's cause is described at #5555 that compiling definition (GGML_USE_SYCL) is not set up correctly at latest source code

DDXDB commented 8 months ago

Your problem is that you need to specify a model file. It uses a default file, but doesn't come with it. use -m /path/to/model to specify where your model is located.

I mainly focus on the absence of the word "For" here. (main: built with MSVC 19.39.33519.0 for) Because others have "For OneApi****" on the end

DDXDB commented 8 months ago

Trying to build with a new file, still not working New error

F:\llama.cpp-master>build\bin\main.exe
Log start
main: build = 0 (unknown)
main: built with IntelLLVM 2024.0.2 for
main: seed  = 1709332726
GGML_SYCL_DEBUG=0
invalid map<K, T> key
Exception caught at file:F:/llama.cpp-master/ggml-sycl.cpp, line:11277, func:operator()
llama_model_load: error loading model: failed to open models/7B/ggml-model-f16.gguf: No such file or directory
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model 'models/7B/ggml-model-f16.gguf'
main: error: unable to load model

After loading the model

F:\llama.cpp-master>build\bin\main.exe -m models\ggml-model-iq2_xxs.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e -ngl 33 -s 0
Log start
main: build = 0 (unknown)
main: built with IntelLLVM 2024.0.2 for
main: seed  = 0
GGML_SYCL_DEBUG=0
invalid map<K, T> key
Exception caught at file:F:/llama.cpp-master/ggml-sycl.cpp, line:11277, func:operator()
llama_model_loader: loaded meta data with 23 key-value pairs and 995 tensors from models\ggml-model-iq2_xxs.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mixtral
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:                         llama.expert_count u32              = 8
llama_model_loader: - kv  10:                    llama.expert_used_count u32              = 2
llama_model_loader: - kv  11:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  12:                       llama.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  13:                          general.file_type u32              = 19
llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  16:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  22:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type  f16:   32 tensors
llama_model_loader: - type q2_K:   33 tensors
llama_model_loader: - type q4_K:   32 tensors
llama_model_loader: - type q5_K:    1 tensors
llama_model_loader: - type iq2_xxs:  832 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 8
llm_load_print_meta: n_expert_used    = 2
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = IQ2_XXS - 2.0625 bpw
llm_load_print_meta: model params     = 46.70 B
llm_load_print_meta: model size       = 11.44 GiB (2.10 BPW)
llm_load_print_meta: general.name     = mixtral
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.76 MiB
llama_model_load: error loading model: invalid map<K, T> key
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model 'models\ggml-model-iq2_xxs.gguf'
main: error: unable to load model
NeoZhangJianyu commented 8 months ago

@DDXDB Currently, SYCL backend don't support iq2_xxs. Please try with other model file, like https://huggingface.co/TheBloke/Llama-2-7B-GGUF/blob/main/llama-2-7b.Q4_0.gguf

If you try llama.cpp for SYCL, please try with the guide to verify your hardware/software firstly. Then switch to your model file.

DDXDB commented 8 months ago

@DDXDB Currently, SYCL backend don't support iq2_xxs. Please try with other model file, like https://huggingface.co/TheBloke/Llama-2-7B-GGUF/blob/main/llama-2-7b.Q4_0.gguf

If you try llama.cpp for SYCL, please try with the guide to verify your hardware/software firstly. Then switch to your model file.

It does not seem to be the problem of the model, the model you recommended was replaced, and the results are as follows

:: initializing oneAPI environment...
   Initializing Visual Studio command-line environment...
   Visual Studio version 17.9.2 environment configured.
   "C:\Program Files\Microsoft Visual Studio\2022\Community\"
   Visual Studio command-line environment initialized for: 'x64'
:  advisor -- latest
:  compiler -- latest
:  dal -- latest
:  debugger -- latest
:  dev-utilities -- latest
:  dnnl -- latest
:  dpcpp-ct -- latest
:  dpl -- latest
:  ipp -- latest
:  ippcp -- latest
:  mkl -- latest
:  tbb -- latest
:  vtune -- latest
:: oneAPI environment initialized ::

C:\Program Files (x86)\Intel\oneAPI>"C:\Program Files (x86)\Intel\oneAPI\setvars.bat" intel64
:: WARNING: setvars.bat has already been run. Skipping re-execution.
   To force a re-execution of setvars.bat, use the '--force' option.
   Using '--force' can result in excessive use of your environment variables.

C:\Program Files (x86)\Intel\oneAPI>F:

F:\>

F:\>cd F:\llama.cpp-master

F:\llama.cpp-master>set GGML_SYCL_DEVICE=0,1

F:\llama.cpp-master>set GGML_SYCL_DEVICE
GGML_SYCL_DEVICE=0,1

F:\llama.cpp-master>set GGML_SYCL_DEVICE=0

F:\llama.cpp-master>build\bin\main.exe -m models\llama-2-7b.Q4_0.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e -ngl 33 -s 0
Log start
main: build = 0 (unknown)
main: built with IntelLLVM 2024.0.2 for
main: seed  = 0
GGML_SYCL_DEBUG=0
invalid map<K, T> key
Exception caught at file:F:/llama.cpp-master/ggml-sycl.cpp, line:11277, func:operator()
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from models\llama-2-7b.Q4_0.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 2
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  17:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  18:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 6.74 B
llm_load_print_meta: model size       = 3.56 GiB (4.54 BPW)
llm_load_print_meta: general.name     = LLaMA v2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.22 MiB
llama_model_load: error loading model: invalid map<K, T> key
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model 'models\llama-2-7b.Q4_0.gguf'
main: error: unable to load model

F:\llama.cpp-master>
NeoZhangJianyu commented 7 months ago

@DDXDB Looks like it can't detect the device correctly. Could you try with latest code? And run ./build/bin/ls-sycl-device.exe

DDXDB commented 7 months ago

@DDXDB Looks like it can't detect the device correctly. Could you try with latest code? And run ./build/bin/ls-sycl-device.exe

The latest master code works for me! By the way, does llama.cpp support multiple Gpus? I own multiple SYCL level Zero devices

NeoZhangJianyu commented 6 months ago

@DDXDB It's great to see your result. Yes, support multiple GPUs as default. But it supports more GPUs with same max compute units.

If you have iGPU + dGPU, it will use only dGPU. If you have 2 same dGPU, it will use both dGPU.

DDXDB commented 6 months ago

@DDXDB It's great to see your result. Yes, support multiple GPUs as default. But it supports more GPUs with same max compute units.

If you have iGPU + dGPU, it will use only dGPU. If you have 2 same dGPU, it will use both dGPU.

emmmm.... I have two different dGpu, a770 and a750

github-actions[bot] commented 5 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.