igpu - Githubissues

Hi @abetlen!

I am experimenting with llama.cpp and llama-cpp-python on my Windows laptop with Intel iGPU. I'm getting a build issue that I can't sort out yet.

I can get llama.cpp built for Intel iGPU and working with it, but when I try llama-cpp-python, (I think) it recompiles my llama.cpp for CPU by default, and if I try to make it compile for the GPU, I get a CMake error.

For context:

I got all my dependencies sorted out for CPU following: https://github.com/ggerganov/llama.cpp/blob/master/docs/build.md
I got all my dependencies sorted out for GPU following: https://github.com/ggerganov/llama.cpp/blob/master/docs/backend/SYCL.md#windows
I tried to compile my GPU enabled llama-cpp-python with the windows version of https://github.com/abetlen/llama-cpp-python?tab=readme-ov-file#supported-backends (SYCL) => I updated it to match the content of https://github.com/ggerganov/llama.cpp/blob/master/examples/sycl/win-build-sycl.bat

When I do:

"C:\Program Files (x86)\Intel\oneAPI\setvars.bat" intel64 --force
set CMAKE_ARGS="-DGGML_SYCL=ON -DCMAKE_C_COMPILER=cl -DCMAKE_CXX_COMPILER=icx -DBUILD_SHARED_LIBS=ON"
pip install -e . --verbose

I get the following error, when the test for SYCL is executed:

 Using pip 24.3.1 from C:\Users\dnoliver\AppData\Local\miniconda3\envs\poc\Lib\site-packages\pip (python 3.11)
Obtaining file:///C:/Users/dnoliver/GitHub/dnoliver/llama-cpp-python
  Running command pip subprocess to install build dependencies
  Using pip 24.3.1 from C:\Users\dnoliver\AppData\Local\miniconda3\envs\poc\Lib\site-packages\pip (python 3.11)
  Collecting scikit-build-core>=0.9.2 (from scikit-build-core[pyproject]>=0.9.2)
    Obtaining dependency information for scikit-build-core>=0.9.2 from https://files.pythonhosted.org/packages/88/fe/90476c4f6a1b2f922efa00d26e876dd40c7279e28ec18f08f0851ad21ba6/scikit_build_core-0.10.7-py3-none-any.whl.metadata
    Using cached scikit_build_core-0.10.7-py3-none-any.whl.metadata (21 kB)
  Collecting packaging>=21.3 (from scikit-build-core>=0.9.2->scikit-build-core[pyproject]>=0.9.2)
    Obtaining dependency information for packaging>=21.3 from https://files.pythonhosted.org/packages/88/ef/eb23f262cca3c0c4eb7ab1933c3b1f03d021f2c48f54763065b6f0e321be/packaging-24.2-py3-none-any.whl.metadata
    Using cached packaging-24.2-py3-none-any.whl.metadata (3.2 kB)
  Collecting pathspec>=0.10.1 (from scikit-build-core>=0.9.2->scikit-build-core[pyproject]>=0.9.2)
    Obtaining dependency information for pathspec>=0.10.1 from https://files.pythonhosted.org/packages/cc/20/ff623b09d963f88bfde16306a54e12ee5ea43e9b597108672ff3a408aad6/pathspec-0.12.1-py3-none-any.whl.metadata
    Using cached pathspec-0.12.1-py3-none-any.whl.metadata (21 kB)
  Using cached scikit_build_core-0.10.7-py3-none-any.whl (165 kB)
  Using cached packaging-24.2-py3-none-any.whl (65 kB)
  Using cached pathspec-0.12.1-py3-none-any.whl (31 kB)
  Installing collected packages: pathspec, packaging, scikit-build-core
  Successfully installed packaging-24.2 pathspec-0.12.1 scikit-build-core-0.10.7
  Installing build dependencies ... done
  Running command Checking if build backend supports build_editable
  Checking if build backend supports build_editable ... done
  Running command Getting requirements to build editable
  Getting requirements to build editable ... done
  Running command Preparing editable metadata (pyproject.toml)
  *** scikit-build-core 0.10.7 using CMake 3.31.0 (metadata_editable)
  Preparing editable metadata (pyproject.toml) ... done
Requirement already satisfied: typing-extensions>=4.5.0 in c:\users\dnoliver\appdata\local\miniconda3\envs\poc\lib\site-packages (from llama_cpp_python==0.3.1) (4.12.2)
Requirement already satisfied: numpy>=1.20.0 in c:\users\dnoliver\appdata\local\miniconda3\envs\poc\lib\site-packages (from llama_cpp_python==0.3.1) (2.1.3)
Requirement already satisfied: diskcache>=5.6.1 in c:\users\dnoliver\appdata\local\miniconda3\envs\poc\lib\site-packages (from llama_cpp_python==0.3.1) (5.6.3)
Requirement already satisfied: jinja2>=2.11.3 in c:\users\dnoliver\appdata\local\miniconda3\envs\poc\lib\site-packages (from llama_cpp_python==0.3.1) (3.1.4)
Requirement already satisfied: MarkupSafe>=2.0 in c:\users\dnoliver\appdata\local\miniconda3\envs\poc\lib\site-packages (from jinja2>=2.11.3->llama_cpp_python==0.3.1) (3.0.2)
Building wheels for collected packages: llama_cpp_python
  Running command Building editable for llama_cpp_python (pyproject.toml)
  *** scikit-build-core 0.10.7 using CMake 3.31.0 (editable)
  *** Configuring CMake...
  2024-11-18 14:30:52,414 - scikit_build_core - WARNING - Can't find a Python library, got libdir=None, ldlibrary=None, multiarch=None, masd=None
  loading initial cache file C:\Users\dnoliver\AppData\Local\Temp\tmpe4am77ao\build\CMakeInit.txt
  -- Building for: Visual Studio 17 2022
  -- Selecting Windows SDK version 10.0.22621.0 to target Windows 10.0.22631.
  -- The C compiler identification is MSVC 19.42.34433.0
  -- The CXX compiler identification is MSVC 19.42.34433.0
  -- Detecting C compiler ABI info
  -- Detecting C compiler ABI info - done
  -- Check for working C compiler: C:/Program Files/Microsoft Visual Studio/2022/Community/VC/Tools/MSVC/14.42.34433/bin/Hostx64/x64/cl.exe - skipped
  -- Detecting C compile features
  -- Detecting C compile features - done
  -- Detecting CXX compiler ABI info
  -- Detecting CXX compiler ABI info - done
  -- Check for working CXX compiler: C:/Program Files/Microsoft Visual Studio/2022/Community/VC/Tools/MSVC/14.42.34433/bin/Hostx64/x64/cl.exe - skipped
  -- Detecting CXX compile features
  -- Detecting CXX compile features - done
  -- Found Git: C:/Program Files/Git/cmd/git.exe (found version "2.47.0.windows.2")
  -- Performing Test CMAKE_HAVE_LIBC_PTHREAD
  -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
  -- Looking for pthread_create in pthreads
  -- Looking for pthread_create in pthreads - not found
  -- Looking for pthread_create in pthread
  -- Looking for pthread_create in pthread - not found
  -- Found Threads: TRUE
  -- Warning: ccache not found - consider installing it for faster compilation or disable this warning with GGML_CCACHE=OFF
  -- CMAKE_SYSTEM_PROCESSOR: AMD64
  -- CMAKE_GENERATOR_PLATFORM: x64
  -- Found OpenMP_C: -openmp (found version "2.0")
  -- Found OpenMP_CXX: -openmp (found version "2.0")
  -- Found OpenMP: TRUE (found version "2.0")
  -- OpenMP found
  -- Using llamafile
  -- x86 detected
  -- Performing Test HAS_AVX_1
  -- Performing Test HAS_AVX_1 - Success
  -- Performing Test HAS_AVX2_1
  -- Performing Test HAS_AVX2_1 - Success
  -- Performing Test HAS_FMA_1
  -- Performing Test HAS_FMA_1 - Success
  -- Performing Test HAS_AVX512_1
  -- Performing Test HAS_AVX512_1 - Failed
  -- Performing Test HAS_AVX512_2
  -- Performing Test HAS_AVX512_2 - Failed
  -- Including CPU backend
  -- Including CPU backend
  -- Using AMX
  -- Including AMX backend
  -- Performing Test SUPPORTS_SYCL
  -- Performing Test SUPPORTS_SYCL - Failed
  -- Using oneAPI Release SYCL compiler (icpx).
  -- SYCL found
  -- DNNL found:1
  CMake Error at vendor/llama.cpp/ggml/src/ggml-sycl/CMakeLists.txt:65 (find_package):
    Found package configuration file:

      C:/Program Files (x86)/Intel/oneAPI/compiler/latest/lib/cmake/IntelSYCL/IntelSYCLConfig.cmake

    but it set IntelSYCL_FOUND to FALSE so package "IntelSYCL" is considered to
    be NOT FOUND.  Reason given by package:

    Unsupported compiler family MSVC and compiler C:/Program Files/Microsoft
    Visual
    Studio/2022/Community/VC/Tools/MSVC/14.42.34433/bin/Hostx64/x64/cl.exe!!

  -- Configuring incomplete, errors occurred!

  *** CMake configuration failed
  error: subprocess-exited-with-error

  × Building editable for llama_cpp_python (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> See above for output.

  note: This error originates from a subprocess, and is likely not a problem with pip.
  full command: 'C:\Users\dnoliver\AppData\Local\miniconda3\envs\poc\python.exe' 'C:\Users\dnoliver\AppData\Local\miniconda3\envs\poc\Lib\site-packages\pip\_vendor\pyproject_hooks\_in_process\_in_process.py' build_editable 'C:\Users\dnoliver\AppData\Local\Temp\tmp6xaj06kh'
  cwd: C:\Users\dnoliver\GitHub\dnoliver\llama-cpp-python
  Building editable for llama_cpp_python (pyproject.toml) ... error
  ERROR: Failed building editable for llama_cpp_python
Failed to build llama_cpp_python
ERROR: ERROR: Failed to build installable wheels for some pyproject.toml based projects (llama_cpp_python)

I have run what I think is the equivalent command in llama.cpp, and I am getting all CMake tests passing (same terminal, so I guess same env vars and stuff). For command cmake -B build -G "Ninja" -DGGML_SYCL=ON -DCMAKE_C_COMPILER=cl -DCMAKE_CXX_COMPILER=icx -DCMAKE_BUILD_TYPE=Release -DGGML_SYCL_F16=ON I get this succesful output:

-- The C compiler identification is MSVC 19.42.34433.0
-- The CXX compiler identification is IntelLLVM 2025.0.0 with MSVC-like command-line
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: C:/Program Files/Microsoft Visual Studio/2022/Community/VC/Tools/MSVC/14.42.34433/bin/Hostx64/x64/cl.exe - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: C:/Program Files (x86)/Intel/oneAPI/compiler/latest/bin/icx.exe - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Git: C:/Program Files/Git/cmd/git.exe (found version "2.47.0.windows.2")
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - not found
-- Found Threads: TRUE
-- Warning: ccache not found - consider installing it for faster compilation or disable this warning with GGML_CCACHE=OFF
-- CMAKE_SYSTEM_PROCESSOR: AMD64
-- CMAKE_GENERATOR_PLATFORM:
-- Found OpenMP_C: -openmp (found version "2.0")
-- Found OpenMP_CXX: -Qiopenmp (found version "5.1")
-- Found OpenMP: TRUE (found version "2.0")
-- OpenMP found
-- Using llamafile
-- x86 detected
-- Performing Test HAS_AVX_1
-- Performing Test HAS_AVX_1 - Success
-- Performing Test HAS_AVX2_1
-- Performing Test HAS_AVX2_1 - Success
-- Performing Test HAS_FMA_1
-- Performing Test HAS_FMA_1 - Success
-- Performing Test HAS_AVX512_1
-- Performing Test HAS_AVX512_1 - Failed
-- Performing Test HAS_AVX512_2
-- Performing Test HAS_AVX512_2 - Failed
-- Including CPU backend
CMake Warning at ggml/src/ggml-amx/CMakeLists.txt:106 (message):
  AMX requires x86 and gcc version > 11.0.  Turning off GGML_AMX.

-- Performing Test SUPPORTS_SYCL
-- Performing Test SUPPORTS_SYCL - Success
-- Using oneAPI Release SYCL compiler (icpx).
-- SYCL found
-- DNNL found:1
-- Found IntelSYCL: C:/Program Files (x86)/Intel/oneAPI/compiler/latest/include (found version "202001")
-- MKL_VERSION: 2025.0.0
-- MKL_ROOT: C:/Program Files (x86)/Intel/oneAPI/mkl/latest
-- MKL_ARCH: intel64
-- MKL_SYCL_LINK: None, set to ` dynamic` by default
-- MKL_LINK: None, set to ` dynamic` by default
-- MKL_SYCL_INTERFACE_FULL: None, set to ` intel_ilp64` by default
-- MKL_INTERFACE_FULL: None, set to ` intel_ilp64` by default
-- MKL_SYCL_THREADING: None, set to ` tbb_thread` by default
-- MKL_THREADING: None, set to ` intel_thread` by default
-- MKL_MPI: None, set to ` intelmpi` by default
-- Found C:/Program Files (x86)/Intel/oneAPI/mkl/latest/lib/mkl_scalapack_ilp64_dll.lib
-- Found DLL: C:/Program Files (x86)/Intel/oneAPI/mkl/latest/bin/mkl_scalapack_ilp64.2.dll
-- Found C:/Program Files (x86)/Intel/oneAPI/mkl/latest/lib/mkl_cdft_core_dll.lib
-- Found DLL: C:/Program Files (x86)/Intel/oneAPI/mkl/latest/bin/mkl_cdft_core.2.dll
-- Found C:/Program Files (x86)/Intel/oneAPI/mkl/latest/lib/mkl_intel_ilp64_dll.lib
-- Found C:/Program Files (x86)/Intel/oneAPI/mkl/latest/lib/mkl_intel_thread_dll.lib
-- Found DLL: C:/Program Files (x86)/Intel/oneAPI/mkl/latest/bin/mkl_intel_thread.2.dll
-- Found C:/Program Files (x86)/Intel/oneAPI/mkl/latest/lib/mkl_core_dll.lib
-- Found DLL: C:/Program Files (x86)/Intel/oneAPI/mkl/latest/bin/mkl_core.2.dll
-- Found C:/Program Files (x86)/Intel/oneAPI/mkl/latest/lib/mkl_blacs_ilp64_dll.lib
-- Found DLL: C:/Program Files (x86)/Intel/oneAPI/mkl/latest/bin/mkl_blacs_ilp64.2.dll
-- Found C:/Program Files (x86)/Intel/oneAPI/mkl/latest/lib/mkl_sycl_blas_dll.lib
-- Found DLL: C:/Program Files (x86)/Intel/oneAPI/mkl/latest/bin/mkl_sycl_blas.5.dll
-- Found C:/Program Files (x86)/Intel/oneAPI/mkl/latest/lib/mkl_sycl_lapack_dll.lib
-- Found DLL: C:/Program Files (x86)/Intel/oneAPI/mkl/latest/bin/mkl_sycl_lapack.5.dll
-- Found C:/Program Files (x86)/Intel/oneAPI/mkl/latest/lib/mkl_sycl_dft_dll.lib
-- Found DLL: C:/Program Files (x86)/Intel/oneAPI/mkl/latest/bin/mkl_sycl_dft.5.dll
-- Found C:/Program Files (x86)/Intel/oneAPI/mkl/latest/lib/mkl_sycl_sparse_dll.lib
-- Found DLL: C:/Program Files (x86)/Intel/oneAPI/mkl/latest/bin/mkl_sycl_sparse.5.dll
-- Found C:/Program Files (x86)/Intel/oneAPI/mkl/latest/lib/mkl_sycl_data_fitting_dll.lib
-- Found DLL: C:/Program Files (x86)/Intel/oneAPI/mkl/latest/bin/mkl_sycl_data_fitting.5.dll
-- Found C:/Program Files (x86)/Intel/oneAPI/mkl/latest/lib/mkl_sycl_rng_dll.lib
-- Found DLL: C:/Program Files (x86)/Intel/oneAPI/mkl/latest/bin/mkl_sycl_rng.5.dll
-- Found C:/Program Files (x86)/Intel/oneAPI/mkl/latest/lib/mkl_sycl_stats_dll.lib
-- Found DLL: C:/Program Files (x86)/Intel/oneAPI/mkl/latest/bin/mkl_sycl_stats.5.dll
-- Found C:/Program Files (x86)/Intel/oneAPI/mkl/latest/lib/mkl_sycl_vm_dll.lib
-- Found DLL: C:/Program Files (x86)/Intel/oneAPI/mkl/latest/bin/mkl_sycl_vm.5.dll
-- Found C:/Program Files (x86)/Intel/oneAPI/mkl/latest/lib/mkl_tbb_thread_dll.lib
-- Found DLL: C:/Program Files (x86)/Intel/oneAPI/mkl/latest/bin/mkl_tbb_thread.2.dll
-- Found C:/Program Files (x86)/Intel/oneAPI/compiler/latest/lib/libiomp5md.lib
-- Including SYCL backend
-- Configuring done (37.1s)
-- Generating done (1.0s)
-- Build files have been written to: C:/Users/dnoliver/GitHub/dnoliver/llama-cpp-python/vendor/llama.cpp/build

Can you help me sort out this build problem please?

Found the problem. It is caused by LLAVA_BUILD=ON. Disabling it makes it work.

To summarize:

"C:\Program Files (x86)\Intel\oneAPI\setvars.bat" intel64 --force
set CMAKE_ARGS="-DLLAVA_BUILD=OFF -DGGML_SYCL=ON -DCMAKE_C_COMPILER=cl -DCMAKE_CXX_COMPILER=icx -DBUILD_SHARED_LIBS=ON"
pip install -e . --verbose

gets you a version that works with the iGPU.

Then with this code, you get the model using the iGPU to produce a completion:

from llama_cpp import Llama

llm = Llama(
      model_path="C:/Users/dnoliver/Downloads/Phi-3.5-mini-instruct.Q4_0.gguf", # 
      n_gpu_layers=-1, # Uncomment to use GPU acceleration
      seed=1337, # Uncomment to set a specific seed
      n_ctx=2048, # Uncomment to increase the context window
)
output = llm(
      "<|system|>You are a helpful digital assistant.<|end|><|user|>Name the planets in the solar system.<|end|><|assistant|>", # Prompt
      max_tokens=256, # Generate up to 32 tokens, set to None to generate up to the end of the context window
      echo=True # Echo the prompt back in the output
) # Generate a completion, can also call create_completion
print(output)

And this is the output of that snippet,

llama_model_loader: loaded meta data with 36 key-value pairs and 197 tensors from C:/Users/dnoliver/Downloads/Phi-3.5-mini-instruct.Q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = phi3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Phi 3.5 Mini Instruct
llama_model_loader: - kv   3:                           general.finetune str              = instruct
llama_model_loader: - kv   4:                           general.basename str              = Phi-3.5
llama_model_loader: - kv   5:                         general.size_label str              = mini
llama_model_loader: - kv   6:                            general.license str              = mit
llama_model_loader: - kv   7:                       general.license.link str              = https://huggingface.co/microsoft/Phi-...
llama_model_loader: - kv   8:                               general.tags arr[str,3]       = ["nlp", "code", "text-generation"]
llama_model_loader: - kv   9:                          general.languages arr[str,1]       = ["multilingual"]
llama_model_loader: - kv  10:                        phi3.context_length u32              = 131072
llama_model_loader: - kv  11:  phi3.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  12:                      phi3.embedding_length u32              = 3072
llama_model_loader: - kv  13:                   phi3.feed_forward_length u32              = 8192
llama_model_loader: - kv  14:                           phi3.block_count u32              = 32
llama_model_loader: - kv  15:                  phi3.attention.head_count u32              = 32
llama_model_loader: - kv  16:               phi3.attention.head_count_kv u32              = 32
llama_model_loader: - kv  17:      phi3.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  18:                  phi3.rope.dimension_count u32              = 96
llama_model_loader: - kv  19:                        phi3.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  20:                          general.file_type u32              = 2
llama_model_loader: - kv  21:              phi3.attention.sliding_window u32              = 262144
llama_model_loader: - kv  22:              phi3.rope.scaling.attn_factor f32              = 1.190238
llama_model_loader: - kv  23:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  24:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  25:                      tokenizer.ggml.tokens arr[str,32064]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  26:                      tokenizer.ggml.scores arr[f32,32064]   = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  27:                  tokenizer.ggml.token_type arr[i32,32064]   = [3, 3, 4, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  28:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  29:                tokenizer.ggml.eos_token_id u32              = 32000
llama_model_loader: - kv  30:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  31:            tokenizer.ggml.padding_token_id u32              = 32000
llama_model_loader: - kv  32:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  33:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  34:                    tokenizer.chat_template str              = {% for message in messages %}{% if me...
llama_model_loader: - kv  35:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   67 tensors
llama_model_loader: - type q4_0:  129 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: control token:      0 '<unk>' is not marked as EOG
llm_load_vocab: control token:      1 '<s>' is not marked as EOG
llm_load_vocab: control token:  32010 '<|user|>' is not marked as EOG
llm_load_vocab: control token:  32006 '<|system|>' is not marked as EOG
llm_load_vocab: control token:  32008 '<|placeholder5|>' is not marked as EOG
llm_load_vocab: control token:  32009 '<|placeholder6|>' is not marked as EOG
llm_load_vocab: control token:  32003 '<|placeholder2|>' is not marked as EOG
llm_load_vocab: control token:  32005 '<|placeholder4|>' is not marked as EOG
llm_load_vocab: control token:  32004 '<|placeholder3|>' is not marked as EOG
llm_load_vocab: control token:  32002 '<|placeholder1|>' is not marked as EOG
llm_load_vocab: control token:  32001 '<|assistant|>' is not marked as EOG
llm_load_vocab: special tokens cache size = 14
llm_load_vocab: token to piece cache size = 0.1685 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = phi3
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32064
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 3072
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_rot            = 96
llm_load_print_meta: n_swa            = 262144
llm_load_print_meta: n_embd_head_k    = 96
llm_load_print_meta: n_embd_head_v    = 96
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 3072
llm_load_print_meta: n_embd_v_gqa     = 3072
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 8192
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 3B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 3.82 B
llm_load_print_meta: model size       = 2.03 GiB (4.55 BPW)
llm_load_print_meta: general.name     = Phi 3.5 Mini Instruct
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 32000 '<|endoftext|>'
llm_load_print_meta: EOT token        = 32007 '<|end|>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 32000 '<|endoftext|>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_print_meta: EOG token        = 32000 '<|endoftext|>'
llm_load_print_meta: EOG token        = 32007 '<|end|>'
llm_load_print_meta: max token length = 48
llm_load_tensors:   CPU_Mapped model buffer size =  2074.66 MiB
....................................................................................
llama_new_context_with_model: n_seq_max     = 1
llama_new_context_with_model: n_ctx         = 2048
llama_new_context_with_model: n_ctx_per_seq = 2048
llama_new_context_with_model: n_batch       = 512
llama_new_context_with_model: n_ubatch      = 512
llama_new_context_with_model: flash_attn    = 0
llama_new_context_with_model: freq_base     = 10000.0
llama_new_context_with_model: freq_scale    = 1
llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init:        CPU KV buffer size =   768.00 MiB
llama_new_context_with_model: KV self size  =  768.00 MiB, K (f16):  384.00 MiB, V (f16):  384.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.12 MiB
llama_new_context_with_model:        CPU compute buffer size =   168.01 MiB
llama_new_context_with_model: graph nodes  = 1286
llama_new_context_with_model: graph splits = 1
AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | AMX_INT8 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
Model metadata: {'general.name': 'Phi 3.5 Mini Instruct', 'general.architecture': 'phi3', 'general.type': 'model', 'general.basename': 'Phi-3.5', 'general.finetune': 'instruct', 'general.size_label': 'mini', 'general.license': 'mit', 'general.license.link': 'https://huggingface.co/microsoft/Phi-3.5-mini-instruct/resolve/main/LICENSE', 'phi3.attention.head_count_kv': '32', 'phi3.context_length': '131072', 'phi3.rope.scaling.original_context_length': '4096', 'phi3.embedding_length': '3072', 'tokenizer.ggml.model': 'llama', 'phi3.feed_forward_length': '8192', 'phi3.block_count': '32', 'phi3.attention.head_count': '32', 'phi3.attention.layer_norm_rms_epsilon': '0.000010', 'phi3.rope.dimension_count': '96', 'tokenizer.chat_template': "{% for message in messages %}{% if message['role'] == 'system' and message['content'] %}{{'<|system|>\n' + message['content'] + '<|end|>\n'}}{% elif message['role'] == 'user' %}{{'<|user|>\n' + message['content'] + '<|end|>\n'}}{% elif message['role'] == 'assistant' %}{{'<|assistant|>\n' + message['content'] + '<|end|>\n'}}{% endif %}{% endfor %}{% if add_generation_prompt %}{{ '<|assistant|>\n' }}{% else %}{{ eos_token }}{% endif %}", 'phi3.rope.freq_base': '10000.000000', 'tokenizer.ggml.eos_token_id': '32000', 'general.file_type': '2', 'tokenizer.ggml.add_eos_token': 'false', 'phi3.attention.sliding_window': '262144', 'phi3.rope.scaling.attn_factor': '1.190238', 'tokenizer.ggml.pre': 'default', 'general.quantization_version': '2', 'tokenizer.ggml.bos_token_id': '1', 'tokenizer.ggml.unknown_token_id': '0', 'tokenizer.ggml.padding_token_id': '32000', 'tokenizer.ggml.add_bos_token': 'false'}
Available chat formats from metadata: chat_template.default
Using gguf chat template: {% for message in messages %}{% if message['role'] == 'system' and message['content'] %}{{'<|system|>
' + message['content'] + '<|end|>
'}}{% elif message['role'] == 'user' %}{{'<|user|>
' + message['content'] + '<|end|>
'}}{% elif message['role'] == 'assistant' %}{{'<|assistant|>
' + message['content'] + '<|end|>
'}}{% endif %}{% endfor %}{% if add_generation_prompt %}{{ '<|assistant|>
' }}{% else %}{{ eos_token }}{% endif %}
Using chat eos_token: <|endoftext|>
Using chat bos_token: <s>
llama_perf_context_print:        load time =    4044.24 ms
llama_perf_context_print: prompt eval time =       0.00 ms /    21 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /   211 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =   60743.25 ms /   232 tokens
{'id': 'cmpl-9fb1ec80-f028-42e5-9633-6a1d030a757a', 'object': 'text_completion', 'created': 1732060442, 'model': 'C:/Users/dnoliver/Downloads/Phi-3.5-mini-instruct.Q4_0.gguf', 'choices': [{'text': '<|system|>You are a helpful digital assistant.<|end|><|user|>Name the planets in the solar system.<|end|><|assistant|> Here are the eight planets in our solar system, listed in order from the sun:\n\n1. Mercury\n2. Venus\n3. Earth\n4. Mars\n5. Jupiter\n6. Saturn\n7. Uranus\n8. Neptune\n\nIn addition, Pluto used to be considered the ninth planet, but it was reclassified as a "dwarf planet" by the International Astronomical Union in 2 extraterrestrial bodies that orbit the sun:\n\n9. Eris (and its moon Dactyl)\n10. Haumea\n11. Makemake\n12. Ceres\n\nThese are located in a region beyond Neptune called the Kuiper Belt, where many other dwarf planets and small objects can be found. Some astronomers also consider the hypothetical Planet Nine, which is proposed to exist beyond Neptune, but it has yet to be conclusively observed.', 'index': 0, 'logprobs': None, 'finish_reason': 'stop'}], 'usage': {'prompt_tokens': 21, 'completion_tokens': 211, 'total_tokens': 232}}

The response is quite interesting :), but the perf information is coming as inf:

llama_perf_context_print:        load time =    4044.24 ms
llama_perf_context_print: prompt eval time =       0.00 ms /    21 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /   211 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =   60743.25 ms /   232 tokens

abetlen / llama-cpp-python

igpu #1709