LostRuins / koboldcpp

A simple one-file way to run various GGML and GGUF models with KoboldAI's UI
https://github.com/lostruins/koboldcpp
GNU Affero General Public License v3.0
4.36k stars 312 forks source link

Context Length Issue #788

Closed Alihkhawaher closed 2 months ago

Alihkhawaher commented 2 months ago

Dears,

I tried a few mistral models with context 32k, but when I go over 8k koboldcpp started returning gibberish, at the start I thought it was the issue with the model then I tired LM Studio and I easily reach 11K without the same issue.

I tried disabling/enabling mmq and contextshift but the issue is still the same.

maybe the following could help

***
Attempting to use CuBLAS library for faster prompt ingestion. A compatible CuBLAS will be required.
Initializing dynamic library: koboldcpp_cublas.dll
==========
Namespace(bantokens=None, benchmark=None, blasbatchsize=512, blasthreads=17, chatcompletionsadapter=None, config=None, contextsize=32768, debugmode=0, forceversion=0, foreground=False, gpulayers=200, highpriority=False, hordeconfig=None, host='', ignoremissing=False, launch=True, lora=None, mmproj='C:\\AI\\Text\\koboldcpp\\mistral-7b-mmproj-v1.5-Q4_1.gguf', model=None, model_param='C:/AI/Weights/ML Models (LM Studio)/macadeliccc/laser-dolphin-mixtral-2x7b-dpo-GGUF/laser-dolphin-mixtral-2x7b-dpo.q6_k.gguf', multiuser=0, noavx2=False, noblas=False, nocertify=False, nommap=False, noshift=False, onready='', password=None, port=5001, port_param=5000, preloadstory=None, quiet=False, remotetunnel=False, ropeconfig=[0.0, 10000.0], sdconfig=None, skiplauncher=False, smartcontext=False, ssl=None, tensor_split=None, threads=17, useclblast=None, usecublas=['normal', 'mmq'], usemlock=True, usevulkan=None)
==========
Loading model: C:\AI\Weights\ML Models (LM Studio)\macadeliccc\laser-dolphin-mixtral-2x7b-dpo-GGUF\laser-dolphin-mixtral-2x7b-dpo.q6_k.gguf
[Threads: 17, BlasThreads: 17, SmartContext: False, ContextShift: True]

The reported GGUF Arch is: llama

---
Identified as GGUF model: (ver 6)
Attempting to Load...
---
Using automatic RoPE scaling. If the model has customized RoPE settings, they will be used directly instead!
System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 |
llama_model_loader: loaded meta data with 25 key-value pairs and 419 tensors from C:\AI\Weights\ML Models (LM Studio)\macadelicc?I?هëllm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 2
llm_load_print_meta: n_expert_used    = 2
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = all F32
llm_load_print_meta: model params     = 12.88 B
llm_load_print_meta: model size       = 9.84 GiB (6.56 BPW)
llm_load_print_meta: general.name     = .
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 1 '<s>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
ggml_cuda_init: found 2 CUDA devices:
  Device 0: Tesla P40, compute capability 6.1, VMM: no
  Device 1: NVIDIA GeForce GTX 1070, compute capability 6.1, VMM: yes
llm_load_tensors: ggml ctx size =    0.64 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:  CUDA_Host buffer size =   102.54 MiB
llm_load_tensors:      CUDA0 buffer size =  7712.11 MiB
llm_load_tensors:      CUDA1 buffer size =  2261.95 MiB
....................................................................................................
Automatic RoPE Scaling: Using (scale:1.000, base:10000.0).
llama_new_context_with_model: n_ctx      = 32864
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =  3209.38 MiB
llama_kv_cache_init:      CUDA1 KV buffer size =   898.62 MiB
llama_new_context_with_model: KV self size  = 4108.00 MiB, K (f16): 2054.00 MiB, V (f16): 2054.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.12 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=4)
llama_new_context_with_model:      CUDA0 compute buffer size =  2394.77 MiB
llama_new_context_with_model:      CUDA1 compute buffer size =  2394.77 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =   264.77 MiB
llama_new_context_with_model: graph nodes  = 1638
llama_new_context_with_model: graph splits = 3

Attempting to apply Multimodal Projector: C:\AI\Text\koboldcpp\mistral-7b-mmproj-v1.5-Q4_1.gguf
clip_model_load: model name:   openai/clip-vit-large-patch14-336
clip_model_load: description:  image encoder for LLaVA
clip_model_load: GGUF version: 3
clip_model_load: alignment:    32
clip_model_load: n_tensors:    377
clip_model_load: n_kv:         19
clip_model_load: ftype:        q4_1

clip_model_load: loaded meta data with 19 key-value pairs and 377 tensors from C:\AI\Text\koboldcpp\mistral-7b-mmproj-v1.5-Q4_1.gguf
clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
clip_model_load: - kv   0:                       general.architecture str              = clip
clip_model_load: - kv   1:                      clip.has_text_encoder bool             = false
clip_model_load: - kv   2:                    clip.has_vision_encoder bool             = true
clip_model_load: - kv   3:                   clip.has_llava_projector bool             = true
clip_model_load: - kv   4:                          general.file_type u32              = 3
clip_model_load: - kv   5:                               general.name str              = openai/clip-vit-large-patch14-336
clip_model_load: - kv   6:                        general.description str              = image encoder for LLaVA
clip_model_load: - kv   7:                     clip.vision.image_size u32              = 336
clip_model_load: - kv   8:                     clip.vision.patch_size u32              = 14
clip_model_load: - kv   9:               clip.vision.embedding_length u32              = 1024
clip_model_load: - kv  10:            clip.vision.feed_forward_length u32              = 4096
clip_model_load: - kv  11:                 clip.vision.projection_dim u32              = 768
clip_model_load: - kv  12:           clip.vision.attention.head_count u32              = 16
clip_model_load: - kv  13:   clip.vision.attention.layer_norm_epsilon f32              = 0.000010
clip_model_load: - kv  14:                    clip.vision.block_count u32              = 23
clip_model_load: - kv  15:                     clip.vision.image_mean arr[f32,3]       = [0.481455, 0.457828, 0.408211]
clip_model_load: - kv  16:                      clip.vision.image_std arr[f32,3]       = [0.268630, 0.261303, 0.275777]
clip_model_load: - kv  17:                              clip.use_gelu bool             = false
clip_model_load: - kv  18:               general.quantization_version u32              = 2
clip_model_load: - type  f32:  235 tensors
clip_model_load: - type  f16:    1 tensors
clip_model_load: - type q4_1:  141 tensors
clip_model_load: CLIP using CUDA backend
clip_model_load: text_encoder:   0
clip_model_load: vision_encoder: 1
clip_model_load: llava_projector:  1
clip_model_load: model size:     187.71 MB
clip_model_load: metadata size:  0.17 MB
clip_model_load: params backend buffer size =  187.71 MB (377 tensors)
key clip.vision.image_grid_pinpoints not found in file
key clip.vision.mm_patch_merge_type not found in file
key clip.vision.image_crop_resolution not found in file
clip_model_load: compute allocated memory: 32.89 MB
Load Text Model OK: True
Embedded Kobold Lite loaded.
Starting Kobold API on port 5000 at http://localhost:5000/api/
Starting OpenAI Compatible API on port 5000 at http://localhost:5000/v1/
======
Please connect to custom endpoint at http://localhost:5000
aleksusklim commented 2 months ago

Is this version 1.62.2 ? Have you tried 1.61.2 ?

I played with mixtrals up to 64k several times without any problems. …Oh wait, you said mistral, not mixtral. Hm-m, is it a RoPE issue? Anyway, try the older version!

I see you use mmproj. Will the problem persist without it?

Alihkhawaher commented 2 months ago

Thanks, I tried with both versions I got the same issue, also even LM Studio stopped working, so I am not sure how it worked previously. The issue appears mostly with MoE, the models I am using are 2x7b. I suspected my P40 may have an issue and I tested VRAM using OCCT and I have no errors. I will update you if I have more information.

removing mmproj does not change the result

LostRuins commented 2 months ago

If LM studio is also not working, then that seems to point to the model being the culprit. Try changing your RoPE scaling settings, how about --ropeconfig 1 32000, and see if that works for you. Or try --ropeconfig 0.5 10000

Alihkhawaher commented 2 months ago

Great, I tried your custom ropeconfig, 1 32000 and I was able to go beyond 8k, I reached 11K without an issue.

I needed to fix it because macadeliccc/laser-dolphin-mixtral-2x7b-dpo-GGUF has been the smartest 7b model I tried.

Thanks,