abetlen / llama-cpp-python

Python bindings for llama.cpp
https://llama-cpp-python.readthedocs.io
MIT License
7.66k stars 919 forks source link

Trying to load llm model using llama cpp python with GPU support fails with an OSError: exception: access violation reading 0x0000000000000000 #1581

Open Sanjit0910 opened 2 months ago

Sanjit0910 commented 2 months ago

Description

When attempting to set up llama cpp python for GPU support using CUDA toolkit, following the documented steps, the initialization of the llama-cpp model fails with an access violation error.

Steps to Reproduce

  1. Install CUDA Toolkit v12.4

  2. Setup Environment Variables:

set CMAKE_ARGS="-DGGML_CUDA=on" set FORCE_CMAKE=1

  1. Uninstall and upgrade the llama-cpp-python(with numpy==1.26.4 to avoid other dependency issues):

poetry run pip install --force-reinstall --no-cache-dir llama-cpp-python "numpy==1.26.4"

  1. The installation is successful and the GPU is detected but the model loading fails with the error:

llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from C:\Users\stormy101\Documents\private-gpt\models\mistral-7b-instruct-v0.2.Q4_K_M.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = mistralai_mistral-7b-instruct-v0.2 llama_model_loader: - kv 2: llama.context_length u32 = 32768 llama_model_loader: - kv 3: llama.embedding_length u32 = 4096 llama_model_loader: - kv 4: llama.block_count u32 = 32 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: llama.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 11: general.file_type u32 = 15 llama_model_loader: - kv 12: tokenizer.ggml.model str = llama llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["", "", "", "<0x00>", "<... llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 22: tokenizer.chat_template str = {{ bos_token }}{% for message in mess... llama_model_loader: - kv 23: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_K: 193 tensors llama_model_loader: - type q6_K: 33 tensors llm_load_vocab: special tokens cache size = 259 llm_load_vocab: token to piece cache size = 0.1637 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 32768 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 32768 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 7B llm_load_print_meta: model ftype = Q4_K - Medium llm_load_print_meta: model params = 7.24 B llm_load_print_meta: model size = 4.07 GiB (4.83 BPW) llm_load_print_meta: general.name = mistralai_mistral-7b-instruct-v0.2 llm_load_print_meta: BOS token = 1 '' llm_load_print_meta: EOS token = 2 '' llm_load_print_meta: UNK token = 0 '' llm_load_print_meta: PAD token = 0 '' llm_load_print_meta: LF token = 13 '<0x0A>' ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA RTX 2000 Ada Generation Laptop GPU, compute capability 8.9, VMM: yes Traceback (most recent call last):

File "C:\Users\stormy101\Documents\private-gpt\private_gpt\components\llm\llm_component.py", line 57, in init self.llm = LlamaCPP( ^^^^^^^^^ File "C:\Users\stormy101\AppData\Local\anaconda3\envs\privategpt\Lib\site-packages\llama_index\llms\llama_cpp\base.py", line 109, in init self._model = Llama(model_path=model_path, **model_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\stormy101\AppData\Local\anaconda3\envs\privategpt\Lib\site-packages\llama_cpp\llama.py", line 349, in init self._model = _LlamaModel( ^^^^^^^^^^^^ File "C:\Users\stormy101\AppData\Local\anaconda3\envs\privategpt\Lib\site-packages\llama_cpp_internals.py", line 52, in init self.model = llama_cpp.llama_load_model_from_file( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ OSError: exception: access violation reading 0x0000000000000000

Environment Details:

  1. Python: 3.11.9
  2. CUDA Toolkit Version: CUDA 12.4
  3. OS: Windows 11

I am able to load the model on CPU. I tried to downgrade to 0.2.78 but the error persists. Need your help to resolve the issue. Thank you.

Asgir commented 1 month ago

I have the same issue.

Description

When attempting to set up llama cpp python for GPU support using CUDA toolkit, following the documented steps, the initialization of the llama-cpp model fails with an access violation error.

Steps to Reproduce

Install CUDA Toolkit v12.5

Create conda environment and install llama_cpp:

conda create -n llama_clean conda activate llama_clean conda install pip set CMAKE_ARGS=-DGGML_CUDA=on set FORCE_CMAKE=1 cd C:\Users\User\anaconda3\envs\llama_clean Scripts\pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir

Start python and try to load model

from llama_cpp import Llama model = Llama(model_path="models/generator/Mistral-7B-Instruct-v0.3.Q6_K.gguf", n_ctx=2048, n_gpu_layers=999, embedding=False) llama_model_loader: loaded meta data with 29 key-value pairs and 291 tensors from models/generator/Mistral-7B-Instruct-v0.3.Q6_K.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = models--mistralai--Mistral-7B-Instruc... llama_model_loader: - kv 2: llama.block_count u32 = 32 llama_model_loader: - kv 3: llama.context_length u32 = 32768 llama_model_loader: - kv 4: llama.embedding_length u32 = 4096 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.attention.head_count u32 = 32 llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 8: llama.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: general.file_type u32 = 18 llama_model_loader: - kv 11: llama.vocab_size u32 = 32768 llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 13: tokenizer.ggml.model str = llama llama_model_loader: - kv 14: tokenizer.ggml.pre str = default llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,32768] = ["", "", "", "[INST]", "[... llama_model_loader: - kv 16: tokenizer.ggml.scores arr[f32,32768] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,32768] = [2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ... llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 20: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 21: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 22: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 23: tokenizer.chat_template str = {{ bos_token }}{% for message in mess... llama_model_loader: - kv 24: general.quantization_version u32 = 2 llama_model_loader: - kv 25: quantize.imatrix.file str = ./imatrix.dat llama_model_loader: - kv 26: quantize.imatrix.dataset str = group_40.txt llama_model_loader: - kv 27: quantize.imatrix.entries_count i32 = 224 llama_model_loader: - kv 28: quantize.imatrix.chunks_count i32 = 74 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q6_K: 226 tensors llm_load_vocab: special tokens cache size = 771 llm_load_vocab: token to piece cache size = 0.1731 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32768 llm_load_print_meta: n_merges = 0 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 32768 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 32768 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 7B llm_load_print_meta: model ftype = Q6_K llm_load_print_meta: model params = 7.25 B llm_load_print_meta: model size = 5.54 GiB (6.56 BPW) llm_load_print_meta: general.name = models--mistralai--Mistral-7B-Instruct-v0.3 llm_load_print_meta: BOS token = 1 '' llm_load_print_meta: EOS token = 2 '' llm_load_print_meta: UNK token = 0 '' llm_load_print_meta: LF token = 781 '<0x0A>' llm_load_print_meta: max token length = 48 Exception ignored in: <function Llama.del at 0x00000176EE09DC60> Traceback (most recent call last): File "C:\Users\User\anaconda3\envs\llama_clean\Lib\site-packages\llama_cpp\llama.py", line 2089, in del if self._lora_adapter is not None: ^^^^^^^^^^^^^^^^^^ AttributeError: 'Llama' object has no attribute '_lora_adapter' Traceback (most recent call last): File "", line 1, in File "C:\Users\User\anaconda3\envs\llama_clean\Lib\site-packages\llama_cpp\llama.py", line 372, in init _LlamaModel( File "C:\Users\User\anaconda3\envs\llama_clean\Lib\site-packages\llama_cpp_internals.py", line 50, in init self.model = llama_cpp.llama_load_model_from_file( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ OSError: exception: access violation reading 0x0000000000000000

Environment Details:

Python: 3.12.4 CUDA Toolkit Version: CUDA 12.5 OS: Windows 10

Additional

When installing without CUDA there is no problem. Using gpu_layers=0 but with CUDA installation does not solve the issue. The issue is independent from the model used.

Asgir commented 1 month ago

An update that may help in narrowing this down:

Under windows 11:

So I guess either the problem is with the python-bindings or the llama.dll, but in principle it should be able to work. Does someone maybe have some minimal python-bindings just for loading the model? Would be useful in debugging.

Under WSL:

So under WSL everything works fine. Maybe a (rather cumbersome) workaround if windows does not work.

Devnant commented 1 month ago

Description

When attempting to set up llama cpp python for GPU support using CUDA toolkit, following the documented steps, the initialization of the llama-cpp model fails with an access violation error.

Steps to Reproduce

  1. Install CUDA Toolkit v12.4
  2. Setup Environment Variables:

set CMAKE_ARGS="-DGGML_CUDA=on" set FORCE_CMAKE=1

  1. Uninstall and upgrade the llama-cpp-python(with numpy==1.26.4 to avoid other dependency issues):

poetry run pip install --force-reinstall --no-cache-dir llama-cpp-python "numpy==1.26.4"

  1. The installation is successful and the GPU is detected but the model loading fails with the error:

llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from C:\Users\stormy101\Documents\private-gpt\models\mistral-7b-instruct-v0.2.Q4_K_M.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = mistralai_mistral-7b-instruct-v0.2 llama_model_loader: - kv 2: llama.context_length u32 = 32768 llama_model_loader: - kv 3: llama.embedding_length u32 = 4096 llama_model_loader: - kv 4: llama.block_count u32 = 32 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: llama.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 11: general.file_type u32 = 15 llama_model_loader: - kv 12: tokenizer.ggml.model str = llama llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["", "", "", "<0x00>", "<... llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 22: tokenizer.chat_template str = {{ bos_token }}{% for message in mess... llama_model_loader: - kv 23: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_K: 193 tensors llama_model_loader: - type q6_K: 33 tensors llm_load_vocab: special tokens cache size = 259 llm_load_vocab: token to piece cache size = 0.1637 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 32768 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 32768 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 7B llm_load_print_meta: model ftype = Q4_K - Medium llm_load_print_meta: model params = 7.24 B llm_load_print_meta: model size = 4.07 GiB (4.83 BPW) llm_load_print_meta: general.name = mistralai_mistral-7b-instruct-v0.2 llm_load_print_meta: BOS token = 1 '' llm_load_print_meta: EOS token = 2 '' llm_load_print_meta: UNK token = 0 '' llm_load_print_meta: PAD token = 0 '' llm_load_print_meta: LF token = 13 '<0x0A>' ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA RTX 2000 Ada Generation Laptop GPU, compute capability 8.9, VMM: yes Traceback (most recent call last):

File "C:\Users\stormy101\Documents\private-gpt\private_gpt\components\llm\llm_component.py", line 57, in init self.llm = LlamaCPP( ^^^^^^^^^ File "C:\Users\stormy101\AppData\Local\anaconda3\envs\privategpt\Lib\site-packages\llama_index\llms\llama_cpp\base.py", line 109, in init self._model = Llama(model_path=model_path, model_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\stormy101\AppData\Local\anaconda3\envs\privategpt\Lib\site-packages\llama_cpp\llama.py", line 349, in init self._model = _LlamaModel( ^^^^^^^^^^^^ File "C:\Users\stormy101\AppData\Local\anaconda3\envs\privategpt\Lib\site-packages\llama_cpp_internals.py", line 52, in init** self.model = llama_cpp.llama_load_model_from_file( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ OSError: exception: access violation reading 0x0000000000000000

Environment Details:

  1. Python: 3.11.9
  2. CUDA Toolkit Version: CUDA 12.4
  3. OS: Windows 11

I am able to load the model on CPU. I tried to downgrade to 0.2.78 but the error persists. Need your help to resolve the issue. Thank you.

Yep. Same here. Same version, CUDA version and OS

stduhpf commented 1 month ago

Same issue here, with Windows 10 and Vulkan backend.

kot197 commented 1 month ago

Same here 🤣 Been stuck installing for a week, so happy when I did it then got crushed by this error a min after Anyone actually get this working in windows? with GPU, of course🤣

Windows 10 and CUDA 12.4 Visual Studio

wiktorwysockig5 commented 1 month ago

I had the same problem, for me it was the pandas library that i have imported before. For some reason if you are doing imports, import llama_cpp first and pandas second. Hope it helps!

MatKollar commented 1 month ago

Any updates here ? I have the same issue with CUDA 12.4 and llama-cpp-python version 0.2.85.

tdiz commented 4 weeks ago

the same situation. CUDA 12.5 and llama-cpp-python version 0.2.85. I downgraded numpy to 1.26.4 for dependency reason but it doesn't help. From llama.cpp from cmd it works propertly. But in Jupyter I have the Error:

OSError: exception: access violation reading 0x0000000000000000

UPDATE. Solution for me are based on:

  1. re-install NVIDIA CUDA Toolkit to short path (compare to original with many symbols and spaces)
  2. use base cmd for compailing instead of Develompmnt Shell or CMD of VC

Use this compiler x64 by default (change your version of VC in path string) or use set command in cmd

  1. set path="C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.40.33807\bin\HostX64\x64";%path%
  2. downgrade numpy pip install --upgrade --force-reinstall numpy==1.26.4
  3. and after that follow the readme: set FORCE_CMAKE=1 set CMAKE_ARGS=-DGGML_CUDA=ON pip install -e .

Now I see the right status in Juyter after model loaded:

Device 0: NVIDIA GeForce RTX 4080 Laptop GPU, compute capability 8.9, VMM: yes llm_load_tensors: ggml ctx size = 0.27 MiB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: CPU buffer size = 532.31 MiB llm_load_tensors: CUDA0 buffer size = 7605.34 MiB