Open ayttop opened 3 months ago
@ayttop maybe someone else knows better but for integrated graphics compiling for the Vulkan bakend may be your only option, though it may not be faster than a CPU installation.
Hi @abetlen!
I am experimenting with llama.cpp and llama-cpp-python on my Windows laptop with Intel iGPU. I'm getting a build issue that I can't sort out yet.
I can get llama.cpp built for Intel iGPU and working with it, but when I try llama-cpp-python, (I think) it recompiles my llama.cpp for CPU by default, and if I try to make it compile for the GPU, I get a CMake error.
For context:
When I do:
"C:\Program Files (x86)\Intel\oneAPI\setvars.bat" intel64 --force
set CMAKE_ARGS="-DGGML_SYCL=ON -DCMAKE_C_COMPILER=cl -DCMAKE_CXX_COMPILER=icx -DBUILD_SHARED_LIBS=ON"
pip install -e . --verbose
I get the following error, when the test for SYCL is executed:
Using pip 24.3.1 from C:\Users\dnoliver\AppData\Local\miniconda3\envs\poc\Lib\site-packages\pip (python 3.11)
Obtaining file:///C:/Users/dnoliver/GitHub/dnoliver/llama-cpp-python
Running command pip subprocess to install build dependencies
Using pip 24.3.1 from C:\Users\dnoliver\AppData\Local\miniconda3\envs\poc\Lib\site-packages\pip (python 3.11)
Collecting scikit-build-core>=0.9.2 (from scikit-build-core[pyproject]>=0.9.2)
Obtaining dependency information for scikit-build-core>=0.9.2 from https://files.pythonhosted.org/packages/88/fe/90476c4f6a1b2f922efa00d26e876dd40c7279e28ec18f08f0851ad21ba6/scikit_build_core-0.10.7-py3-none-any.whl.metadata
Using cached scikit_build_core-0.10.7-py3-none-any.whl.metadata (21 kB)
Collecting packaging>=21.3 (from scikit-build-core>=0.9.2->scikit-build-core[pyproject]>=0.9.2)
Obtaining dependency information for packaging>=21.3 from https://files.pythonhosted.org/packages/88/ef/eb23f262cca3c0c4eb7ab1933c3b1f03d021f2c48f54763065b6f0e321be/packaging-24.2-py3-none-any.whl.metadata
Using cached packaging-24.2-py3-none-any.whl.metadata (3.2 kB)
Collecting pathspec>=0.10.1 (from scikit-build-core>=0.9.2->scikit-build-core[pyproject]>=0.9.2)
Obtaining dependency information for pathspec>=0.10.1 from https://files.pythonhosted.org/packages/cc/20/ff623b09d963f88bfde16306a54e12ee5ea43e9b597108672ff3a408aad6/pathspec-0.12.1-py3-none-any.whl.metadata
Using cached pathspec-0.12.1-py3-none-any.whl.metadata (21 kB)
Using cached scikit_build_core-0.10.7-py3-none-any.whl (165 kB)
Using cached packaging-24.2-py3-none-any.whl (65 kB)
Using cached pathspec-0.12.1-py3-none-any.whl (31 kB)
Installing collected packages: pathspec, packaging, scikit-build-core
Successfully installed packaging-24.2 pathspec-0.12.1 scikit-build-core-0.10.7
Installing build dependencies ... done
Running command Checking if build backend supports build_editable
Checking if build backend supports build_editable ... done
Running command Getting requirements to build editable
Getting requirements to build editable ... done
Running command Preparing editable metadata (pyproject.toml)
*** scikit-build-core 0.10.7 using CMake 3.31.0 (metadata_editable)
Preparing editable metadata (pyproject.toml) ... done
Requirement already satisfied: typing-extensions>=4.5.0 in c:\users\dnoliver\appdata\local\miniconda3\envs\poc\lib\site-packages (from llama_cpp_python==0.3.1) (4.12.2)
Requirement already satisfied: numpy>=1.20.0 in c:\users\dnoliver\appdata\local\miniconda3\envs\poc\lib\site-packages (from llama_cpp_python==0.3.1) (2.1.3)
Requirement already satisfied: diskcache>=5.6.1 in c:\users\dnoliver\appdata\local\miniconda3\envs\poc\lib\site-packages (from llama_cpp_python==0.3.1) (5.6.3)
Requirement already satisfied: jinja2>=2.11.3 in c:\users\dnoliver\appdata\local\miniconda3\envs\poc\lib\site-packages (from llama_cpp_python==0.3.1) (3.1.4)
Requirement already satisfied: MarkupSafe>=2.0 in c:\users\dnoliver\appdata\local\miniconda3\envs\poc\lib\site-packages (from jinja2>=2.11.3->llama_cpp_python==0.3.1) (3.0.2)
Building wheels for collected packages: llama_cpp_python
Running command Building editable for llama_cpp_python (pyproject.toml)
*** scikit-build-core 0.10.7 using CMake 3.31.0 (editable)
*** Configuring CMake...
2024-11-18 14:30:52,414 - scikit_build_core - WARNING - Can't find a Python library, got libdir=None, ldlibrary=None, multiarch=None, masd=None
loading initial cache file C:\Users\dnoliver\AppData\Local\Temp\tmpe4am77ao\build\CMakeInit.txt
-- Building for: Visual Studio 17 2022
-- Selecting Windows SDK version 10.0.22621.0 to target Windows 10.0.22631.
-- The C compiler identification is MSVC 19.42.34433.0
-- The CXX compiler identification is MSVC 19.42.34433.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: C:/Program Files/Microsoft Visual Studio/2022/Community/VC/Tools/MSVC/14.42.34433/bin/Hostx64/x64/cl.exe - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: C:/Program Files/Microsoft Visual Studio/2022/Community/VC/Tools/MSVC/14.42.34433/bin/Hostx64/x64/cl.exe - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Git: C:/Program Files/Git/cmd/git.exe (found version "2.47.0.windows.2")
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - not found
-- Found Threads: TRUE
-- Warning: ccache not found - consider installing it for faster compilation or disable this warning with GGML_CCACHE=OFF
-- CMAKE_SYSTEM_PROCESSOR: AMD64
-- CMAKE_GENERATOR_PLATFORM: x64
-- Found OpenMP_C: -openmp (found version "2.0")
-- Found OpenMP_CXX: -openmp (found version "2.0")
-- Found OpenMP: TRUE (found version "2.0")
-- OpenMP found
-- Using llamafile
-- x86 detected
-- Performing Test HAS_AVX_1
-- Performing Test HAS_AVX_1 - Success
-- Performing Test HAS_AVX2_1
-- Performing Test HAS_AVX2_1 - Success
-- Performing Test HAS_FMA_1
-- Performing Test HAS_FMA_1 - Success
-- Performing Test HAS_AVX512_1
-- Performing Test HAS_AVX512_1 - Failed
-- Performing Test HAS_AVX512_2
-- Performing Test HAS_AVX512_2 - Failed
-- Including CPU backend
-- Including CPU backend
-- Using AMX
-- Including AMX backend
-- Performing Test SUPPORTS_SYCL
-- Performing Test SUPPORTS_SYCL - Failed
-- Using oneAPI Release SYCL compiler (icpx).
-- SYCL found
-- DNNL found:1
CMake Error at vendor/llama.cpp/ggml/src/ggml-sycl/CMakeLists.txt:65 (find_package):
Found package configuration file:
C:/Program Files (x86)/Intel/oneAPI/compiler/latest/lib/cmake/IntelSYCL/IntelSYCLConfig.cmake
but it set IntelSYCL_FOUND to FALSE so package "IntelSYCL" is considered to
be NOT FOUND. Reason given by package:
Unsupported compiler family MSVC and compiler C:/Program Files/Microsoft
Visual
Studio/2022/Community/VC/Tools/MSVC/14.42.34433/bin/Hostx64/x64/cl.exe!!
-- Configuring incomplete, errors occurred!
*** CMake configuration failed
error: subprocess-exited-with-error
× Building editable for llama_cpp_python (pyproject.toml) did not run successfully.
│ exit code: 1
╰─> See above for output.
note: This error originates from a subprocess, and is likely not a problem with pip.
full command: 'C:\Users\dnoliver\AppData\Local\miniconda3\envs\poc\python.exe' 'C:\Users\dnoliver\AppData\Local\miniconda3\envs\poc\Lib\site-packages\pip\_vendor\pyproject_hooks\_in_process\_in_process.py' build_editable 'C:\Users\dnoliver\AppData\Local\Temp\tmp6xaj06kh'
cwd: C:\Users\dnoliver\GitHub\dnoliver\llama-cpp-python
Building editable for llama_cpp_python (pyproject.toml) ... error
ERROR: Failed building editable for llama_cpp_python
Failed to build llama_cpp_python
ERROR: ERROR: Failed to build installable wheels for some pyproject.toml based projects (llama_cpp_python)
I have run what I think is the equivalent command in llama.cpp, and I am getting all CMake tests passing (same terminal, so I guess same env vars and stuff). For command cmake -B build -G "Ninja" -DGGML_SYCL=ON -DCMAKE_C_COMPILER=cl -DCMAKE_CXX_COMPILER=icx -DCMAKE_BUILD_TYPE=Release -DGGML_SYCL_F16=ON
I get this succesful output:
-- The C compiler identification is MSVC 19.42.34433.0
-- The CXX compiler identification is IntelLLVM 2025.0.0 with MSVC-like command-line
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: C:/Program Files/Microsoft Visual Studio/2022/Community/VC/Tools/MSVC/14.42.34433/bin/Hostx64/x64/cl.exe - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: C:/Program Files (x86)/Intel/oneAPI/compiler/latest/bin/icx.exe - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Git: C:/Program Files/Git/cmd/git.exe (found version "2.47.0.windows.2")
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - not found
-- Found Threads: TRUE
-- Warning: ccache not found - consider installing it for faster compilation or disable this warning with GGML_CCACHE=OFF
-- CMAKE_SYSTEM_PROCESSOR: AMD64
-- CMAKE_GENERATOR_PLATFORM:
-- Found OpenMP_C: -openmp (found version "2.0")
-- Found OpenMP_CXX: -Qiopenmp (found version "5.1")
-- Found OpenMP: TRUE (found version "2.0")
-- OpenMP found
-- Using llamafile
-- x86 detected
-- Performing Test HAS_AVX_1
-- Performing Test HAS_AVX_1 - Success
-- Performing Test HAS_AVX2_1
-- Performing Test HAS_AVX2_1 - Success
-- Performing Test HAS_FMA_1
-- Performing Test HAS_FMA_1 - Success
-- Performing Test HAS_AVX512_1
-- Performing Test HAS_AVX512_1 - Failed
-- Performing Test HAS_AVX512_2
-- Performing Test HAS_AVX512_2 - Failed
-- Including CPU backend
CMake Warning at ggml/src/ggml-amx/CMakeLists.txt:106 (message):
AMX requires x86 and gcc version > 11.0. Turning off GGML_AMX.
-- Performing Test SUPPORTS_SYCL
-- Performing Test SUPPORTS_SYCL - Success
-- Using oneAPI Release SYCL compiler (icpx).
-- SYCL found
-- DNNL found:1
-- Found IntelSYCL: C:/Program Files (x86)/Intel/oneAPI/compiler/latest/include (found version "202001")
-- MKL_VERSION: 2025.0.0
-- MKL_ROOT: C:/Program Files (x86)/Intel/oneAPI/mkl/latest
-- MKL_ARCH: intel64
-- MKL_SYCL_LINK: None, set to ` dynamic` by default
-- MKL_LINK: None, set to ` dynamic` by default
-- MKL_SYCL_INTERFACE_FULL: None, set to ` intel_ilp64` by default
-- MKL_INTERFACE_FULL: None, set to ` intel_ilp64` by default
-- MKL_SYCL_THREADING: None, set to ` tbb_thread` by default
-- MKL_THREADING: None, set to ` intel_thread` by default
-- MKL_MPI: None, set to ` intelmpi` by default
-- Found C:/Program Files (x86)/Intel/oneAPI/mkl/latest/lib/mkl_scalapack_ilp64_dll.lib
-- Found DLL: C:/Program Files (x86)/Intel/oneAPI/mkl/latest/bin/mkl_scalapack_ilp64.2.dll
-- Found C:/Program Files (x86)/Intel/oneAPI/mkl/latest/lib/mkl_cdft_core_dll.lib
-- Found DLL: C:/Program Files (x86)/Intel/oneAPI/mkl/latest/bin/mkl_cdft_core.2.dll
-- Found C:/Program Files (x86)/Intel/oneAPI/mkl/latest/lib/mkl_intel_ilp64_dll.lib
-- Found C:/Program Files (x86)/Intel/oneAPI/mkl/latest/lib/mkl_intel_thread_dll.lib
-- Found DLL: C:/Program Files (x86)/Intel/oneAPI/mkl/latest/bin/mkl_intel_thread.2.dll
-- Found C:/Program Files (x86)/Intel/oneAPI/mkl/latest/lib/mkl_core_dll.lib
-- Found DLL: C:/Program Files (x86)/Intel/oneAPI/mkl/latest/bin/mkl_core.2.dll
-- Found C:/Program Files (x86)/Intel/oneAPI/mkl/latest/lib/mkl_blacs_ilp64_dll.lib
-- Found DLL: C:/Program Files (x86)/Intel/oneAPI/mkl/latest/bin/mkl_blacs_ilp64.2.dll
-- Found C:/Program Files (x86)/Intel/oneAPI/mkl/latest/lib/mkl_sycl_blas_dll.lib
-- Found DLL: C:/Program Files (x86)/Intel/oneAPI/mkl/latest/bin/mkl_sycl_blas.5.dll
-- Found C:/Program Files (x86)/Intel/oneAPI/mkl/latest/lib/mkl_sycl_lapack_dll.lib
-- Found DLL: C:/Program Files (x86)/Intel/oneAPI/mkl/latest/bin/mkl_sycl_lapack.5.dll
-- Found C:/Program Files (x86)/Intel/oneAPI/mkl/latest/lib/mkl_sycl_dft_dll.lib
-- Found DLL: C:/Program Files (x86)/Intel/oneAPI/mkl/latest/bin/mkl_sycl_dft.5.dll
-- Found C:/Program Files (x86)/Intel/oneAPI/mkl/latest/lib/mkl_sycl_sparse_dll.lib
-- Found DLL: C:/Program Files (x86)/Intel/oneAPI/mkl/latest/bin/mkl_sycl_sparse.5.dll
-- Found C:/Program Files (x86)/Intel/oneAPI/mkl/latest/lib/mkl_sycl_data_fitting_dll.lib
-- Found DLL: C:/Program Files (x86)/Intel/oneAPI/mkl/latest/bin/mkl_sycl_data_fitting.5.dll
-- Found C:/Program Files (x86)/Intel/oneAPI/mkl/latest/lib/mkl_sycl_rng_dll.lib
-- Found DLL: C:/Program Files (x86)/Intel/oneAPI/mkl/latest/bin/mkl_sycl_rng.5.dll
-- Found C:/Program Files (x86)/Intel/oneAPI/mkl/latest/lib/mkl_sycl_stats_dll.lib
-- Found DLL: C:/Program Files (x86)/Intel/oneAPI/mkl/latest/bin/mkl_sycl_stats.5.dll
-- Found C:/Program Files (x86)/Intel/oneAPI/mkl/latest/lib/mkl_sycl_vm_dll.lib
-- Found DLL: C:/Program Files (x86)/Intel/oneAPI/mkl/latest/bin/mkl_sycl_vm.5.dll
-- Found C:/Program Files (x86)/Intel/oneAPI/mkl/latest/lib/mkl_tbb_thread_dll.lib
-- Found DLL: C:/Program Files (x86)/Intel/oneAPI/mkl/latest/bin/mkl_tbb_thread.2.dll
-- Found C:/Program Files (x86)/Intel/oneAPI/compiler/latest/lib/libiomp5md.lib
-- Including SYCL backend
-- Configuring done (37.1s)
-- Generating done (1.0s)
-- Build files have been written to: C:/Users/dnoliver/GitHub/dnoliver/llama-cpp-python/vendor/llama.cpp/build
Can you help me sort out this build problem please?
Found the problem. It is caused by LLAVA_BUILD=ON. Disabling it makes it work.
To summarize:
"C:\Program Files (x86)\Intel\oneAPI\setvars.bat" intel64 --force
set CMAKE_ARGS="-DLLAVA_BUILD=OFF -DGGML_SYCL=ON -DCMAKE_C_COMPILER=cl -DCMAKE_CXX_COMPILER=icx -DBUILD_SHARED_LIBS=ON"
pip install -e . --verbose
gets you a version that works with the iGPU.
Then with this code, you get the model using the iGPU to produce a completion:
from llama_cpp import Llama
llm = Llama(
model_path="C:/Users/dnoliver/Downloads/Phi-3.5-mini-instruct.Q4_0.gguf", #
n_gpu_layers=-1, # Uncomment to use GPU acceleration
seed=1337, # Uncomment to set a specific seed
n_ctx=2048, # Uncomment to increase the context window
)
output = llm(
"<|system|>You are a helpful digital assistant.<|end|><|user|>Name the planets in the solar system.<|end|><|assistant|>", # Prompt
max_tokens=256, # Generate up to 32 tokens, set to None to generate up to the end of the context window
echo=True # Echo the prompt back in the output
) # Generate a completion, can also call create_completion
print(output)
And this is the output of that snippet,
llama_model_loader: loaded meta data with 36 key-value pairs and 197 tensors from C:/Users/dnoliver/Downloads/Phi-3.5-mini-instruct.Q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = phi3
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Phi 3.5 Mini Instruct
llama_model_loader: - kv 3: general.finetune str = instruct
llama_model_loader: - kv 4: general.basename str = Phi-3.5
llama_model_loader: - kv 5: general.size_label str = mini
llama_model_loader: - kv 6: general.license str = mit
llama_model_loader: - kv 7: general.license.link str = https://huggingface.co/microsoft/Phi-...
llama_model_loader: - kv 8: general.tags arr[str,3] = ["nlp", "code", "text-generation"]
llama_model_loader: - kv 9: general.languages arr[str,1] = ["multilingual"]
llama_model_loader: - kv 10: phi3.context_length u32 = 131072
llama_model_loader: - kv 11: phi3.rope.scaling.original_context_length u32 = 4096
llama_model_loader: - kv 12: phi3.embedding_length u32 = 3072
llama_model_loader: - kv 13: phi3.feed_forward_length u32 = 8192
llama_model_loader: - kv 14: phi3.block_count u32 = 32
llama_model_loader: - kv 15: phi3.attention.head_count u32 = 32
llama_model_loader: - kv 16: phi3.attention.head_count_kv u32 = 32
llama_model_loader: - kv 17: phi3.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 18: phi3.rope.dimension_count u32 = 96
llama_model_loader: - kv 19: phi3.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 20: general.file_type u32 = 2
llama_model_loader: - kv 21: phi3.attention.sliding_window u32 = 262144
llama_model_loader: - kv 22: phi3.rope.scaling.attn_factor f32 = 1.190238
llama_model_loader: - kv 23: tokenizer.ggml.model str = llama
llama_model_loader: - kv 24: tokenizer.ggml.pre str = default
llama_model_loader: - kv 25: tokenizer.ggml.tokens arr[str,32064] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv 26: tokenizer.ggml.scores arr[f32,32064] = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,32064] = [3, 3, 4, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 28: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 32000
llama_model_loader: - kv 30: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 31: tokenizer.ggml.padding_token_id u32 = 32000
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 33: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 34: tokenizer.chat_template str = {% for message in messages %}{% if me...
llama_model_loader: - kv 35: general.quantization_version u32 = 2
llama_model_loader: - type f32: 67 tensors
llama_model_loader: - type q4_0: 129 tensors
llama_model_loader: - type q6_K: 1 tensors
llm_load_vocab: control token: 0 '<unk>' is not marked as EOG
llm_load_vocab: control token: 1 '<s>' is not marked as EOG
llm_load_vocab: control token: 32010 '<|user|>' is not marked as EOG
llm_load_vocab: control token: 32006 '<|system|>' is not marked as EOG
llm_load_vocab: control token: 32008 '<|placeholder5|>' is not marked as EOG
llm_load_vocab: control token: 32009 '<|placeholder6|>' is not marked as EOG
llm_load_vocab: control token: 32003 '<|placeholder2|>' is not marked as EOG
llm_load_vocab: control token: 32005 '<|placeholder4|>' is not marked as EOG
llm_load_vocab: control token: 32004 '<|placeholder3|>' is not marked as EOG
llm_load_vocab: control token: 32002 '<|placeholder1|>' is not marked as EOG
llm_load_vocab: control token: 32001 '<|assistant|>' is not marked as EOG
llm_load_vocab: special tokens cache size = 14
llm_load_vocab: token to piece cache size = 0.1685 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = phi3
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32064
llm_load_print_meta: n_merges = 0
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 3072
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 32
llm_load_print_meta: n_rot = 96
llm_load_print_meta: n_swa = 262144
llm_load_print_meta: n_embd_head_k = 96
llm_load_print_meta: n_embd_head_v = 96
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 3072
llm_load_print_meta: n_embd_v_gqa = 3072
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 8192
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 2
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 4096
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 3B
llm_load_print_meta: model ftype = Q4_0
llm_load_print_meta: model params = 3.82 B
llm_load_print_meta: model size = 2.03 GiB (4.55 BPW)
llm_load_print_meta: general.name = Phi 3.5 Mini Instruct
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 32000 '<|endoftext|>'
llm_load_print_meta: EOT token = 32007 '<|end|>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: PAD token = 32000 '<|endoftext|>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_print_meta: EOG token = 32000 '<|endoftext|>'
llm_load_print_meta: EOG token = 32007 '<|end|>'
llm_load_print_meta: max token length = 48
llm_load_tensors: CPU_Mapped model buffer size = 2074.66 MiB
....................................................................................
llama_new_context_with_model: n_seq_max = 1
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_ctx_per_seq = 2048
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: CPU KV buffer size = 768.00 MiB
llama_new_context_with_model: KV self size = 768.00 MiB, K (f16): 384.00 MiB, V (f16): 384.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.12 MiB
llama_new_context_with_model: CPU compute buffer size = 168.01 MiB
llama_new_context_with_model: graph nodes = 1286
llama_new_context_with_model: graph splits = 1
AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | AMX_INT8 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
Model metadata: {'general.name': 'Phi 3.5 Mini Instruct', 'general.architecture': 'phi3', 'general.type': 'model', 'general.basename': 'Phi-3.5', 'general.finetune': 'instruct', 'general.size_label': 'mini', 'general.license': 'mit', 'general.license.link': 'https://huggingface.co/microsoft/Phi-3.5-mini-instruct/resolve/main/LICENSE', 'phi3.attention.head_count_kv': '32', 'phi3.context_length': '131072', 'phi3.rope.scaling.original_context_length': '4096', 'phi3.embedding_length': '3072', 'tokenizer.ggml.model': 'llama', 'phi3.feed_forward_length': '8192', 'phi3.block_count': '32', 'phi3.attention.head_count': '32', 'phi3.attention.layer_norm_rms_epsilon': '0.000010', 'phi3.rope.dimension_count': '96', 'tokenizer.chat_template': "{% for message in messages %}{% if message['role'] == 'system' and message['content'] %}{{'<|system|>\n' + message['content'] + '<|end|>\n'}}{% elif message['role'] == 'user' %}{{'<|user|>\n' + message['content'] + '<|end|>\n'}}{% elif message['role'] == 'assistant' %}{{'<|assistant|>\n' + message['content'] + '<|end|>\n'}}{% endif %}{% endfor %}{% if add_generation_prompt %}{{ '<|assistant|>\n' }}{% else %}{{ eos_token }}{% endif %}", 'phi3.rope.freq_base': '10000.000000', 'tokenizer.ggml.eos_token_id': '32000', 'general.file_type': '2', 'tokenizer.ggml.add_eos_token': 'false', 'phi3.attention.sliding_window': '262144', 'phi3.rope.scaling.attn_factor': '1.190238', 'tokenizer.ggml.pre': 'default', 'general.quantization_version': '2', 'tokenizer.ggml.bos_token_id': '1', 'tokenizer.ggml.unknown_token_id': '0', 'tokenizer.ggml.padding_token_id': '32000', 'tokenizer.ggml.add_bos_token': 'false'}
Available chat formats from metadata: chat_template.default
Using gguf chat template: {% for message in messages %}{% if message['role'] == 'system' and message['content'] %}{{'<|system|>
' + message['content'] + '<|end|>
'}}{% elif message['role'] == 'user' %}{{'<|user|>
' + message['content'] + '<|end|>
'}}{% elif message['role'] == 'assistant' %}{{'<|assistant|>
' + message['content'] + '<|end|>
'}}{% endif %}{% endfor %}{% if add_generation_prompt %}{{ '<|assistant|>
' }}{% else %}{{ eos_token }}{% endif %}
Using chat eos_token: <|endoftext|>
Using chat bos_token: <s>
llama_perf_context_print: load time = 4044.24 ms
llama_perf_context_print: prompt eval time = 0.00 ms / 21 tokens ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: eval time = 0.00 ms / 211 runs ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: total time = 60743.25 ms / 232 tokens
{'id': 'cmpl-9fb1ec80-f028-42e5-9633-6a1d030a757a', 'object': 'text_completion', 'created': 1732060442, 'model': 'C:/Users/dnoliver/Downloads/Phi-3.5-mini-instruct.Q4_0.gguf', 'choices': [{'text': '<|system|>You are a helpful digital assistant.<|end|><|user|>Name the planets in the solar system.<|end|><|assistant|> Here are the eight planets in our solar system, listed in order from the sun:\n\n1. Mercury\n2. Venus\n3. Earth\n4. Mars\n5. Jupiter\n6. Saturn\n7. Uranus\n8. Neptune\n\nIn addition, Pluto used to be considered the ninth planet, but it was reclassified as a "dwarf planet" by the International Astronomical Union in 2 extraterrestrial bodies that orbit the sun:\n\n9. Eris (and its moon Dactyl)\n10. Haumea\n11. Makemake\n12. Ceres\n\nThese are located in a region beyond Neptune called the Kuiper Belt, where many other dwarf planets and small objects can be found. Some astronomers also consider the hypothetical Planet Nine, which is proposed to exist beyond Neptune, but it has yet to be conclusively observed.', 'index': 0, 'logprobs': None, 'finish_reason': 'stop'}], 'usage': {'prompt_tokens': 21, 'completion_tokens': 211, 'total_tokens': 232}}
The response is quite interesting :), but the perf information is coming as inf:
llama_perf_context_print: load time = 4044.24 ms
llama_perf_context_print: prompt eval time = 0.00 ms / 21 tokens ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: eval time = 0.00 ms / 211 runs ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: total time = 60743.25 ms / 232 tokens
Hi @dnoliver ,
I tried to compile with DLLAVA_BUILD=OFF
, but somehow the build ended up without SYCL support.
After some more trial and error I managed to get a working wheel including SYCL support with following commands in the Intel oneAPI command prompt
:
set CMAKE_GENERATOR=Ninja
set CMAKE_ARGS=-DGGML_SYCL=ON -DCMAKE_C_COMPILER=cl -DCMAKE_CXX_COMPILER=icx -DCMAKE_BUILD_TYPE=Release
"C:\Python311\python.exe" -m build --wheel
from llama_cpp import Llama
llm = Llama( model_path="C:\Users\ArabTech\Desktop\4\phi-3.5-mini-instruct-q4_k_m.gguf", n_gpu_layers=-1, verbose=True, ) output = llm( "Q: Who is Napoleon Bonaparte A: ", max_tokens=1024, stop=["\n"] # Add a stop sequence to end generation at a newline ) print(output)
n_gpu_layers=-1 n_gpu_layers=32
not work on igpu intel
how ofload model on igpu intel?