LostRuins / koboldcpp

Run GGUF models easily with a KoboldAI UI. One File. Zero Install.
https://github.com/lostruins/koboldcpp
GNU Affero General Public License v3.0
4.99k stars 349 forks source link

The model quantified using the latest version of llmam.cpp cannot be used in koboldcpp. The Q4-0 model is identified as IQ3_XXS - 3.0625 bpw. #758

Closed win10ogod closed 3 months ago

win10ogod commented 6 months ago

The model quantified using the latest version of llmam.cpp cannot be used in koboldcpp. The Q4-0 model is identified as IQ3_XXS - 3.0625 bpw. After turning mmap off, the error generated is different.

***
Welcome to KoboldCpp - Version 1.61.2
For command line arguments, please refer to --help
***
Attempting to library without OpenBLAS.
Initializing dynamic library: koboldcpp_default.dll
==========
Namespace(bantokens=None, benchmark=None, blasbatchsize=512, blasthreads=5, config=None, contextsize=8192, debugmode=0, forceversion=0, foreground=False, gpulayers=14, highpriority=False, hordeconfig=None, host='', ignoremissing=False, launch=True, lora=None, mmproj=None, model=None, model_param='D:/mergekit-main/BigQwen1.5-20B-V3/BigLiberated-20B-V2-Q4-0.gguf', multiuser=1, noavx2=False, noblas=True, nocertify=False, nommap=True, noshift=False, onready='', password=None, port=5001, port_param=5001, preloadstory=None, quiet=False, remotetunnel=False, ropeconfig=[0.0, 10000.0], sdconfig=None, skiplauncher=False, smartcontext=False, ssl=None, tensor_split=None, threads=5, useclblast=None, usecublas=None, usemlock=False, usevulkan=None)
==========
Loading model: D:\mergekit-main\BigQwen1.5-20B-V3\BigLiberated-20B-V2-Q4-0.gguf
[Threads: 5, BlasThreads: 5, SmartContext: False, ContextShift: True]

The reported GGUF Arch is: llama

---
Identified as GGUF model: (ver 6)
Attempting to Load...
---
Using automatic RoPE scaling. If the model has customized RoPE settings, they will be used directly instead!
System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 |
llama_model_loader: loaded meta data with 21 key-value pairs and 711 tensors from D:\mergekit-main\BigQwen1.5-20B-V3\BigLiberated-20B-V2-Q4-0.gguf (version GGUF V3 (latest))
llm_load_vocab: SPM vocabulary, but newline token not found: unordered_map::at! Using special_pad_id instead.llm_load_vocab: special tokens definition check successful ( 421/152064 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 152064
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 5120
llm_load_print_meta: n_head           = 40
llm_load_print_meta: n_head_kv        = 40
llm_load_print_meta: n_layer          = 59
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 5120
llm_load_print_meta: n_embd_v_gqa     = 5120
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 13696
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attm      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = IQ3_XXS - 3.0625 bpw
llm_load_print_meta: model params     = 20.16 B
llm_load_print_meta: model size       = 10.75 GiB (4.58 BPW)
llm_load_print_meta: general.name     = d:\mergekit-main
llm_load_print_meta: BOS token        = 1 '"'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: UNK token        = 0 '!'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_tensors: ggml ctx size =    0.32 MiB
llm_load_tensors:        CPU buffer size = 11009.51 MiB
.............................................................................................
Automatic RoPE Scaling: Using model internal value.
llama_new_context_with_model: n_ctx      = 8272
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =  9532.19 MiB
llama_new_context_with_model: KV self size  = 9532.19 MiB, K (f16): 4766.09 MiB, V (f16): 4766.09 MiB
llama_new_context_with_model:        CPU  output buffer size =   297.00 MiB
llama_new_context_with_model:        CPU compute buffer size =   702.41 MiB
llama_new_context_with_model: graph splits: 1
Load Text Model OK: True
Embedded Kobold Lite loaded.
Starting Kobold API on port 5001 at http://localhost:5001/api/
Starting OpenAI Compatible API on port 5001 at http://localhost:5001/v1/
======
Please connect to custom endpoint at http://localhost:5001

Input: {"n": 1, "max_context_length": 2048, "max_length": 80, "rep_pen": 1.08, "temperature": 0.62, "top_p": 0.9, "top_k": 0, "top_a": 0, "typical": 1, "tfs": 1, "rep_pen_range": 1024, "rep_pen_slope": 0.7, "sampler_order": [0, 1, 2, 3, 4, 5, 6], "memory": "", "genkey": "KCPP3364", "min_p": 0, "dynatemp_range": 0, "dynatemp_exponent": 1, "smoothing_factor": 0, "presence_penalty": 0, "logit_bias": {}, "prompt": "[The following is an interesting chat message log between You and \u8389\u8389.]\n\nYou: Hi.\n\u8389\u8389: Hello.\nYou: hi.\nYou: hi\nYou: hi\nYou: hi\nYou: hi\nYou: HI\nYou: hi.\n\u8389\u8389: hi\nYou: hi\nYou: hi\nYou: hi\u8389\u8389\nYou: hi\nYou: hi\nYou: hi\nYou: hi\nYou: gghf\nYou: hi\n\u8389\u8389:", "quiet": true, "stop_sequence": ["You:", "\nYou ", "\n\u8389\u8389: "], "use_default_badwordsids": false}

(Note: Sub-optimal sampler_order detected. You may have reduced quality. Recommended sampler values are [6,0,1,3,4,2,5]. This message will only show once per session.)
[WinError -1073741569] Windows Error 0xc00000ff

If mmap is not turned off, the error generated is different. It cannot run with or without BLAS.

***
Welcome to KoboldCpp - Version 1.61.2
For command line arguments, please refer to --help
***
Attempting to use CuBLAS library for faster prompt ingestion. A compatible CuBLAS will be required.
Initializing dynamic library: koboldcpp_cublas.dll
==========
Namespace(bantokens=None, benchmark=None, blasbatchsize=512, blasthreads=5, config=None, contextsize=16384, debugmode=0, forceversion=0, foreground=False, gpulayers=11, highpriority=False, hordeconfig=None, host='', ignoremissing=False, launch=True, lora=None, mmproj=None, model=None, model_param='D:/mergekit-main/BigQwen1.5-20B-V3/BigLiberated-20B-V2-Q4-0.gguf', multiuser=1, noavx2=False, noblas=False, nocertify=False, nommap=True, noshift=False, onready='', password=None, port=5001, port_param=5001, preloadstory=None, quiet=False, remotetunnel=False, ropeconfig=[0.0, 10000.0], sdconfig=None, skiplauncher=False, smartcontext=False, ssl=None, tensor_split=None, threads=5, useclblast=None, usecublas=['normal', '0', 'mmq'], usemlock=False, usevulkan=None)
==========
Loading model: D:\mergekit-main\BigQwen1.5-20B-V3\BigLiberated-20B-V2-Q4-0.gguf
[Threads: 5, BlasThreads: 5, SmartContext: False, ContextShift: True]

The reported GGUF Arch is: llama

---
Identified as GGUF model: (ver 6)
Attempting to Load...
---
Using automatic RoPE scaling. If the model has customized RoPE settings, they will be used directly instead!
System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 |
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3050, compute capability 8.6, VMM: yes
llama_model_loader: loaded meta data with 21 key-value pairs and 711 tensors from D:\mergekit-main\BigQwen1.5-20B-V3\BigLiberate\?*jllm_load_vocab: SPM vocabulary, but newline token not found: invalid unordered_map<K, T> key! Using special_pad_id instead.llm_load_vocab: special tokens definition check successful ( 421/152064 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 152064
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 5120
llm_load_print_meta: n_head           = 40
llm_load_print_meta: n_head_kv        = 40
llm_load_print_meta: n_layer          = 59
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 5120
llm_load_print_meta: n_embd_v_gqa     = 5120
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 13696
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attm      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = unknown, may not work
llm_load_print_meta: model params     = 20.16 B
llm_load_print_meta: model size       = 10.75 GiB (4.58 BPW)
llm_load_print_meta: general.name     = d:\mergekit-main
llm_load_print_meta: BOS token        = 1 '"'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: UNK token        = 0 '!'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_tensors: ggml ctx size =    0.63 MiB
llm_load_tensors: offloading 11 repeating layers to GPU
llm_load_tensors: offloaded 11/60 layers to GPU
llm_load_tensors:  CUDA_Host buffer size =  9148.32 MiB
llm_load_tensors:      CUDA0 buffer size =  1861.19 MiB
.............................................................................................
Automatic RoPE Scaling: Using model internal value.
llama_new_context_with_model: n_ctx      = 16464
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size = 15435.00 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =  3537.19 MiB
llama_new_context_with_model: KV self size  = 18972.19 MiB, K (f16): 9486.09 MiB, V (f16): 9486.09 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =   297.00 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =  1358.41 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =  1368.41 MiB
llama_new_context_with_model: graph splits: 3
Load Text Model OK: True
Embedded Kobold Lite loaded.
Starting Kobold API on port 5001 at http://localhost:5001/api/
Starting OpenAI Compatible API on port 5001 at http://localhost:5001/v1/
======
Please connect to custom endpoint at http://localhost:5001

Input: {"n": 1, "max_context_length": 2048, "max_length": 80, "rep_pen": 1.08, "temperature": 0.62, "top_p": 0.9, "top_k": 0, "top_a": 0, "typical": 1, "tfs": 1, "rep_pen_range": 1024, "rep_pen_slope": 0.7, "sampler_order": [0, 1, 2, 3, 4, 5, 6], "memory": "", "genkey": "KCPP8525", "min_p": 0, "dynatemp_range": 0, "dynatemp_exponent": 1, "smoothing_factor": 0, "presence_penalty": 0, "logit_bias": {}, "prompt": "[The following is an interesting chat message log between You and \u8389\u8389.]\n\nYou: Hi.\n\u8389\u8389: Hello.\nYou: hi.\nYou: hi\nYou: hi\nYou: hi\nYou: hi\nYou: HI\nYou: hi.\n\u8389\u8389: hi\nYou: hi\nYou: hi\nYou: hi\u8389\u8389\nYou: hi\nYou: hi\nYou: hi\nYou: hi\nYou: gghf\n\u8389\u8389:", "quiet": true, "stop_sequence": ["You:", "\nYou ", "\n\u8389\u8389: "], "use_default_badwordsids": false}

(Note: Sub-optimal sampler_order detected. You may have reduced quality. Recommended sampler values are [6,0,1,3,4,2,5]. This message will only show once per session.)
[WinError -529697949] Windows Error 0xe06d7363
win10ogod commented 6 months ago

@LostRuins Can you please update llama.cpp? The strange thing is that even mmap can't detect it. ftype can’t even be recognized.

LostRuins commented 5 months ago

Hello, can you please try the latest release and see if it works for you now?

Spacellary commented 3 months ago

We can probably check this one off. Support should be stable for it now.