LostRuins / koboldcpp

Run GGUF models easily with a KoboldAI UI. One File. Zero Install.
GNU Affero General Public License v3.0
4.99k stars 349 forks source link

The model quantified using the latest version of llmam.cpp cannot be used in koboldcpp. The Q4-0 model is identified as IQ3_XXS - 3.0625 bpw. #758

Closed win10ogod closed 3 months ago

win10ogod commented 6 months ago

The model quantified using the latest version of llmam.cpp cannot be used in koboldcpp. The Q4-0 model is identified as IQ3_XXS - 3.0625 bpw. After turning mmap off, the error generated is different.

Welcome to KoboldCpp - Version 1.61.2
For command line arguments, please refer to --help
Attempting to library without OpenBLAS.
Initializing dynamic library: koboldcpp_default.dll
Namespace(bantokens=None, benchmark=None, blasbatchsize=512, blasthreads=5, config=None, contextsize=8192, debugmode=0, forceversion=0, foreground=False, gpulayers=14, highpriority=False, hordeconfig=None, host='', ignoremissing=False, launch=True, lora=None, mmproj=None, model=None, model_param='D:/mergekit-main/BigQwen1.5-20B-V3/BigLiberated-20B-V2-Q4-0.gguf', multiuser=1, noavx2=False, noblas=True, nocertify=False, nommap=True, noshift=False, onready='', password=None, port=5001, port_param=5001, preloadstory=None, quiet=False, remotetunnel=False, ropeconfig=[0.0, 10000.0], sdconfig=None, skiplauncher=False, smartcontext=False, ssl=None, tensor_split=None, threads=5, useclblast=None, usecublas=None, usemlock=False, usevulkan=None)
Loading model: D:\mergekit-main\BigQwen1.5-20B-V3\BigLiberated-20B-V2-Q4-0.gguf
[Threads: 5, BlasThreads: 5, SmartContext: False, ContextShift: True]

The reported GGUF Arch is: llama

Identified as GGUF model: (ver 6)
Attempting to Load...
Using automatic RoPE scaling. If the model has customized RoPE settings, they will be used directly instead!
System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 |
llama_model_loader: loaded meta data with 21 key-value pairs and 711 tensors from D:\mergekit-main\BigQwen1.5-20B-V3\BigLiberated-20B-V2-Q4-0.gguf (version GGUF V3 (latest))
llm_load_vocab: SPM vocabulary, but newline token not found: unordered_map::at! Using special_pad_id instead.llm_load_vocab: special tokens definition check successful ( 421/152064 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 152064
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 5120
llm_load_print_meta: n_head           = 40
llm_load_print_meta: n_head_kv        = 40
llm_load_print_meta: n_layer          = 59
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 5120
llm_load_print_meta: n_embd_v_gqa     = 5120
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 13696
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attm      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = IQ3_XXS - 3.0625 bpw
llm_load_print_meta: model params     = 20.16 B
llm_load_print_meta: model size       = 10.75 GiB (4.58 BPW)
llm_load_print_meta: general.name     = d:\mergekit-main
llm_load_print_meta: BOS token        = 1 '"'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: UNK token        = 0 '!'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_tensors: ggml ctx size =    0.32 MiB
llm_load_tensors:        CPU buffer size = 11009.51 MiB
Automatic RoPE Scaling: Using model internal value.
llama_new_context_with_model: n_ctx      = 8272
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =  9532.19 MiB
llama_new_context_with_model: KV self size  = 9532.19 MiB, K (f16): 4766.09 MiB, V (f16): 4766.09 MiB
llama_new_context_with_model:        CPU  output buffer size =   297.00 MiB
llama_new_context_with_model:        CPU compute buffer size =   702.41 MiB
llama_new_context_with_model: graph splits: 1
Load Text Model OK: True
Embedded Kobold Lite loaded.
Starting Kobold API on port 5001 at http://localhost:5001/api/
Starting OpenAI Compatible API on port 5001 at http://localhost:5001/v1/
Please connect to custom endpoint at http://localhost:5001

Input: {"n": 1, "max_context_length": 2048, "max_length": 80, "rep_pen": 1.08, "temperature": 0.62, "top_p": 0.9, "top_k": 0, "top_a": 0, "typical": 1, "tfs": 1, "rep_pen_range": 1024, "rep_pen_slope": 0.7, "sampler_order": [0, 1, 2, 3, 4, 5, 6], "memory": "", "genkey": "KCPP3364", "min_p": 0, "dynatemp_range": 0, "dynatemp_exponent": 1, "smoothing_factor": 0, "presence_penalty": 0, "logit_bias": {}, "prompt": "[The following is an interesting chat message log between You and \u8389\u8389.]\n\nYou: Hi.\n\u8389\u8389: Hello.\nYou: hi.\nYou: hi\nYou: hi\nYou: hi\nYou: hi\nYou: HI\nYou: hi.\n\u8389\u8389: hi\nYou: hi\nYou: hi\nYou: hi\u8389\u8389\nYou: hi\nYou: hi\nYou: hi\nYou: hi\nYou: gghf\nYou: hi\n\u8389\u8389:", "quiet": true, "stop_sequence": ["You:", "\nYou ", "\n\u8389\u8389: "], "use_default_badwordsids": false}

(Note: Sub-optimal sampler_order detected. You may have reduced quality. Recommended sampler values are [6,0,1,3,4,2,5]. This message will only show once per session.)
[WinError -1073741569] Windows Error 0xc00000ff

If mmap is not turned off, the error generated is different. It cannot run with or without BLAS.

Welcome to KoboldCpp - Version 1.61.2
For command line arguments, please refer to --help
Attempting to use CuBLAS library for faster prompt ingestion. A compatible CuBLAS will be required.
Initializing dynamic library: koboldcpp_cublas.dll
Namespace(bantokens=None, benchmark=None, blasbatchsize=512, blasthreads=5, config=None, contextsize=16384, debugmode=0, forceversion=0, foreground=False, gpulayers=11, highpriority=False, hordeconfig=None, host='', ignoremissing=False, launch=True, lora=None, mmproj=None, model=None, model_param='D:/mergekit-main/BigQwen1.5-20B-V3/BigLiberated-20B-V2-Q4-0.gguf', multiuser=1, noavx2=False, noblas=False, nocertify=False, nommap=True, noshift=False, onready='', password=None, port=5001, port_param=5001, preloadstory=None, quiet=False, remotetunnel=False, ropeconfig=[0.0, 10000.0], sdconfig=None, skiplauncher=False, smartcontext=False, ssl=None, tensor_split=None, threads=5, useclblast=None, usecublas=['normal', '0', 'mmq'], usemlock=False, usevulkan=None)
Loading model: D:\mergekit-main\BigQwen1.5-20B-V3\BigLiberated-20B-V2-Q4-0.gguf
[Threads: 5, BlasThreads: 5, SmartContext: False, ContextShift: True]

The reported GGUF Arch is: llama

Identified as GGUF model: (ver 6)
Attempting to Load...
Using automatic RoPE scaling. If the model has customized RoPE settings, they will be used directly instead!
System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 |
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3050, compute capability 8.6, VMM: yes
llama_model_loader: loaded meta data with 21 key-value pairs and 711 tensors from D:\mergekit-main\BigQwen1.5-20B-V3\BigLiberate\?*jllm_load_vocab: SPM vocabulary, but newline token not found: invalid unordered_map<K, T> key! Using special_pad_id instead.llm_load_vocab: special tokens definition check successful ( 421/152064 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 152064
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 5120
llm_load_print_meta: n_head           = 40
llm_load_print_meta: n_head_kv        = 40
llm_load_print_meta: n_layer          = 59
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 5120
llm_load_print_meta: n_embd_v_gqa     = 5120
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 13696
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attm      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = unknown, may not work
llm_load_print_meta: model params     = 20.16 B
llm_load_print_meta: model size       = 10.75 GiB (4.58 BPW)
llm_load_print_meta: general.name     = d:\mergekit-main
llm_load_print_meta: BOS token        = 1 '"'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: UNK token        = 0 '!'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_tensors: ggml ctx size =    0.63 MiB
llm_load_tensors: offloading 11 repeating layers to GPU
llm_load_tensors: offloaded 11/60 layers to GPU
llm_load_tensors:  CUDA_Host buffer size =  9148.32 MiB
llm_load_tensors:      CUDA0 buffer size =  1861.19 MiB
Automatic RoPE Scaling: Using model internal value.
llama_new_context_with_model: n_ctx      = 16464
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size = 15435.00 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =  3537.19 MiB
llama_new_context_with_model: KV self size  = 18972.19 MiB, K (f16): 9486.09 MiB, V (f16): 9486.09 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =   297.00 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =  1358.41 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =  1368.41 MiB
llama_new_context_with_model: graph splits: 3
Load Text Model OK: True
Embedded Kobold Lite loaded.
Starting Kobold API on port 5001 at http://localhost:5001/api/
Starting OpenAI Compatible API on port 5001 at http://localhost:5001/v1/
Please connect to custom endpoint at http://localhost:5001

Input: {"n": 1, "max_context_length": 2048, "max_length": 80, "rep_pen": 1.08, "temperature": 0.62, "top_p": 0.9, "top_k": 0, "top_a": 0, "typical": 1, "tfs": 1, "rep_pen_range": 1024, "rep_pen_slope": 0.7, "sampler_order": [0, 1, 2, 3, 4, 5, 6], "memory": "", "genkey": "KCPP8525", "min_p": 0, "dynatemp_range": 0, "dynatemp_exponent": 1, "smoothing_factor": 0, "presence_penalty": 0, "logit_bias": {}, "prompt": "[The following is an interesting chat message log between You and \u8389\u8389.]\n\nYou: Hi.\n\u8389\u8389: Hello.\nYou: hi.\nYou: hi\nYou: hi\nYou: hi\nYou: hi\nYou: HI\nYou: hi.\n\u8389\u8389: hi\nYou: hi\nYou: hi\nYou: hi\u8389\u8389\nYou: hi\nYou: hi\nYou: hi\nYou: hi\nYou: gghf\n\u8389\u8389:", "quiet": true, "stop_sequence": ["You:", "\nYou ", "\n\u8389\u8389: "], "use_default_badwordsids": false}

(Note: Sub-optimal sampler_order detected. You may have reduced quality. Recommended sampler values are [6,0,1,3,4,2,5]. This message will only show once per session.)
[WinError -529697949] Windows Error 0xe06d7363
win10ogod commented 6 months ago

@LostRuins Can you please update llama.cpp? The strange thing is that even mmap can't detect it. ftype can’t even be recognized.

LostRuins commented 5 months ago

Hello, can you please try the latest release and see if it works for you now?

Spacellary commented 3 months ago

We can probably check this one off. Support should be stable for it now.