ggerganov / llama.cpp

LLM inference in C/C++
MIT License
67.86k stars 9.73k forks source link

Bug: Infinite repetitive text generation with Qwen2.5-Coder-32B GGUF model in llama.cpp server #10312

Closed e1ijah1 closed 1 hour ago

e1ijah1 commented 3 hours ago

What happened?

Hi, I'm encountering an issue with repetitive/looping text generation when running the Qwen2.5-Coder-32B-Instruct GGUF model on llama.cpp server.

Environment

Issue

The model keeps generating the same text repeatedly in an infinite loop (see attached screenshot). The output just repeats the same sentence about RoundRobinLoadBalancer over and over without stopping. bug

Questions

  1. Is this a known issue with GGUF models or specifically with Qwen models?
  2. Are there any configuration parameters I should adjust to prevent this behavior?
  3. Could this be related to the context size (-c) or parallel processing settings?

Any help or guidance would be appreciated!

Name and Version

docker run --rm -p 8090:8080 \
        --ipc=host \
        --privileged \
        --shm-size=16g \
        -v/root/workdir:/workdir \
        -v /data/weights:/models \
        --gpus '"device=2"' \
        --env=NCCL_P2P_DISABLE=1 \
        --env=CUDA_VISIBLE_DEVICES="2" \
        ghcr.io/ggerganov/llama.cpp:server-cuda \
        -m models/Qwen2.5-Coder-32B-Instruct-GGUF/qwen2.5-coder-32b-instruct-q2_k.gguf \
        -a 'Qwen/Qwen2.5-Coder-32B-Instruct' \
        -c 30720 \
        --host 0.0.0.0 \
        --port 8080 \
        -ngl 99 \
        --parallel 64

What operating system are you seeing the problem on?

No response

Relevant log output

No response

ngxson commented 3 hours ago

This maybe due to context shifting (but we can't be sure because you didn't post the server log)

Try adding --no-context-shift to disable it.

e1ijah1 commented 2 hours ago

Thanks for the suggestion! Sorry I didn't include the server logs initially. I tried adding the --no-context-shift flag but now the generation stops midway with an error. Here's the complete server log:

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
warn: LLAMA_ARG_HOST environment variable is set, but will be overwritten by com
mand line argument --host
build: 4077 (af148c93) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-
linux-gnu
system info: n_threads = 64, n_threads_batch = 64, total_threads = 128

system_info: n_threads = 64 (n_threads_batch = 64) / 128 | AVX = 1 | AVX_VNNI =
0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0
| AMX_INT8 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA
 = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX =
0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |

main: HTTP server is listening, hostname: 0.0.0.0, port: 8080, http threads: 127
main: loading model
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4090) - 23717
 MiB free
llama_model_loader: loaded meta data with 29 key-value pairs and 771 tensors fro
m models/Qwen2.5-Coder-32B-Instruct-GGUF/qwen2.5-coder-32b-instruct-q2_k.gguf (v
ersion GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not appl
y in this output.
llama_model_loader: - kv   0:                       general.architecture str
          = qwen2
llama_model_loader: - kv   1:                               general.type str
          = model
llama_model_loader: - kv   2:                               general.name str
          = Qwen2.5 Coder 32B Instruct AWQ
llama_model_loader: - kv   3:                           general.finetune str
          = Instruct-AWQ
llama_model_loader: - kv   4:                           general.basename str
          = Qwen2.5-Coder
llama_model_loader: - kv   5:                         general.size_label str
          = 32B
llama_model_loader: - kv   6:                          qwen2.block_count u32
          = 64
llama_model_loader: - kv   7:                       qwen2.context_length u32
          = 131072
llama_model_loader: - kv   8:                     qwen2.embedding_length u32
          = 5120
llama_model_loader: - kv   9:                  qwen2.feed_forward_length u32
          = 27648
llama_model_loader: - kv  10:                 qwen2.attention.head_count u32
          = 40
llama_model_loader: - kv  11:              qwen2.attention.head_count_kv u32
          = 8
llama_model_loader: - kv  12:                       qwen2.rope.freq_base f32
          = 1000000.000000
llama_model_loader: - kv  13:     qwen2.attention.layer_norm_rms_epsilon f32
          = 0.000001
llama_model_loader: - kv  14:                          general.file_type u32
          = 10
llama_model_loader: - kv  15:                       tokenizer.ggml.model str
          = gpt2
llama_model_loader: - kv  16:                         tokenizer.ggml.pre str
          = qwen2
llama_model_loader: - kv  17:                      tokenizer.ggml.tokens arr[str
,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32
,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  19:                      tokenizer.ggml.merges arr[str
,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32
          = 151645
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32
          = 151643
llama_model_loader: - kv  22:                tokenizer.ggml.bos_token_id u32
          = 151643
llama_model_loader: - kv  23:               tokenizer.ggml.add_bos_token bool
          = false
llama_model_loader: - kv  24:                    tokenizer.chat_template str
          = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  25:               general.quantization_version u32
          = 2
llama_model_loader: - kv  26:                                   split.no u16
          = 0
llama_model_loader: - kv  27:                                split.count u16
          = 0
llama_model_loader: - kv  28:                        split.tensors.count i32
          = 771
llama_model_loader: - type  f32:  321 tensors
llama_model_loader: - type q2_K:  257 tensors
llama_model_loader: - type q3_K:  128 tensors
llama_model_loader: - type q4_K:   64 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens cache size = 22
llm_load_vocab: token to piece cache size = 0.9310 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 152064
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 5120
llm_load_print_meta: n_layer          = 64
llm_load_print_meta: n_head           = 40
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 5
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 27648
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = Q2_K - Medium
llm_load_print_meta: model params     = 32.76 B
llm_load_print_meta: model size       = 11.46 GiB (3.01 BPW)
llm_load_print_meta: general.name     = Qwen2.5 Coder 32B Instruct AWQ
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: FIM PRE token    = 151659 '<|fim_prefix|>'
llm_load_print_meta: FIM SUF token    = 151661 '<|fim_suffix|>'
llm_load_print_meta: FIM MID token    = 151660 '<|fim_middle|>'
llm_load_print_meta: FIM PAD token    = 151662 '<|fim_pad|>'
llm_load_print_meta: FIM REP token    = 151663 '<|repo_name|>'
llm_load_print_meta: FIM SEP token    = 151664 '<|file_sep|>'
llm_load_print_meta: EOG token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOG token        = 151645 '<|im_end|>'
llm_load_print_meta: EOG token        = 151662 '<|fim_pad|>'
llm_load_print_meta: EOG token        = 151663 '<|repo_name|>'
llm_load_print_meta: EOG token        = 151664 '<|file_sep|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: offloading 64 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 65/65 layers to GPU
llm_load_tensors:   CPU_Mapped model buffer size =   243.63 MiB
llm_load_tensors:        CUDA0 model buffer size = 11493.35 MiB
................................................................................
...............
llama_new_context_with_model: n_seq_max     = 64
llama_new_context_with_model: n_ctx         = 30720
llama_new_context_with_model: n_ctx_per_seq = 480
llama_new_context_with_model: n_batch       = 2048
llama_new_context_with_model: n_ubatch      = 512
llama_new_context_with_model: flash_attn    = 0
llama_new_context_with_model: freq_base     = 1000000.0
llama_new_context_with_model: freq_scale    = 1
llama_new_context_with_model: n_ctx_per_seq (480) < n_ctx_train (131072) -- the
full capacity of the model will not be utilized
llama_kv_cache_init:      CUDA0 KV buffer size =  7680.00 MiB
llama_new_context_with_model: KV self size  = 7680.00 MiB, K (f16): 3840.00 MiB,
 V (f16): 3840.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =    37.12 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =  2500.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    70.01 MiB
llama_new_context_with_model: graph nodes  = 2246
llama_new_context_with_model: graph splits = 2
common_init_from_params: warming up the model with an empty run - please wait ..
. (--no-warmup to disable)
srv          init: initializing slots, n_slots = 64
slot         init: id  0 | task -1 | new slot n_ctx_slot = 480
slot         init: id  1 | task -1 | new slot n_ctx_slot = 480
slot         init: id  2 | task -1 | new slot n_ctx_slot = 480
slot         init: id  3 | task -1 | new slot n_ctx_slot = 480
slot         init: id  4 | task -1 | new slot n_ctx_slot = 480
slot         init: id  5 | task -1 | new slot n_ctx_slot = 480
slot         init: id  6 | task -1 | new slot n_ctx_slot = 480
slot         init: id  7 | task -1 | new slot n_ctx_slot = 480
slot         init: id  8 | task -1 | new slot n_ctx_slot = 480
slot         init: id  9 | task -1 | new slot n_ctx_slot = 480
slot         init: id 10 | task -1 | new slot n_ctx_slot = 480
slot         init: id 11 | task -1 | new slot n_ctx_slot = 480
slot         init: id 12 | task -1 | new slot n_ctx_slot = 480
slot         init: id 13 | task -1 | new slot n_ctx_slot = 480
slot         init: id 14 | task -1 | new slot n_ctx_slot = 480
slot         init: id 15 | task -1 | new slot n_ctx_slot = 480
slot         init: id 16 | task -1 | new slot n_ctx_slot = 480
slot         init: id 17 | task -1 | new slot n_ctx_slot = 480
slot         init: id 18 | task -1 | new slot n_ctx_slot = 480
slot         init: id 19 | task -1 | new slot n_ctx_slot = 480
slot         init: id 20 | task -1 | new slot n_ctx_slot = 480
slot         init: id 21 | task -1 | new slot n_ctx_slot = 480
slot         init: id 22 | task -1 | new slot n_ctx_slot = 480
slot         init: id 23 | task -1 | new slot n_ctx_slot = 480
slot         init: id 24 | task -1 | new slot n_ctx_slot = 480
slot         init: id 25 | task -1 | new slot n_ctx_slot = 480
slot         init: id 26 | task -1 | new slot n_ctx_slot = 480
slot         init: id 27 | task -1 | new slot n_ctx_slot = 480
slot         init: id 28 | task -1 | new slot n_ctx_slot = 480
slot         init: id 29 | task -1 | new slot n_ctx_slot = 480
slot         init: id 30 | task -1 | new slot n_ctx_slot = 480
slot         init: id 31 | task -1 | new slot n_ctx_slot = 480
slot         init: id 32 | task -1 | new slot n_ctx_slot = 480
slot         init: id 33 | task -1 | new slot n_ctx_slot = 480
slot         init: id 34 | task -1 | new slot n_ctx_slot = 480
slot         init: id 35 | task -1 | new slot n_ctx_slot = 480
slot         init: id 36 | task -1 | new slot n_ctx_slot = 480
slot         init: id 37 | task -1 | new slot n_ctx_slot = 480
slot         init: id 38 | task -1 | new slot n_ctx_slot = 480
slot         init: id 39 | task -1 | new slot n_ctx_slot = 480
slot         init: id 40 | task -1 | new slot n_ctx_slot = 480
slot         init: id 41 | task -1 | new slot n_ctx_slot = 480
slot         init: id 42 | task -1 | new slot n_ctx_slot = 480
slot         init: id 43 | task -1 | new slot n_ctx_slot = 480
slot         init: id 44 | task -1 | new slot n_ctx_slot = 480
slot         init: id 45 | task -1 | new slot n_ctx_slot = 480
slot         init: id 46 | task -1 | new slot n_ctx_slot = 480
slot         init: id 47 | task -1 | new slot n_ctx_slot = 480
slot         init: id 48 | task -1 | new slot n_ctx_slot = 480
slot         init: id 49 | task -1 | new slot n_ctx_slot = 480
slot         init: id 50 | task -1 | new slot n_ctx_slot = 480
slot         init: id 51 | task -1 | new slot n_ctx_slot = 480
slot         init: id 52 | task -1 | new slot n_ctx_slot = 480
slot         init: id 53 | task -1 | new slot n_ctx_slot = 480
slot         init: id 54 | task -1 | new slot n_ctx_slot = 480
slot         init: id 55 | task -1 | new slot n_ctx_slot = 48
slot         init: id 56 | task -1 | new slot n_ctx_slot = 480
slot         init: id 57 | task -1 | new slot n_ctx_slot = 480
slot         init: id 58 | task -1 | new slot n_ctx_slot = 480
slot         init: id 59 | task -1 | new slot n_ctx_slot = 480
slot         init: id 60 | task -1 | new slot n_ctx_slot = 480
slot         init: id 61 | task -1 | new slot n_ctx_slot = 480
slot         init: id 62 | task -1 | new slot n_ctx_slot = 480
slot         init: id 63 | task -1 | new slot n_ctx_slot = 480
main: model loaded
main: chat template, built_in: 1, chat_example: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
'
main: server is listening on http://0.0.0.0:8080 - starting the main loop
srv  update_slots: all slots are idle
request: GET /health 127.0.0.1 200
request: GET /health 127.0.0.1 200
request: GET /health 127.0.0.1 200
request: GET /health 127.0.0.1 200
request: GET /health 127.0.0.1 200
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 480, n_keep = 0, n_
prompt_tokens = 28
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 28, n_t
okens = 28, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 28, n_tokens = 28
request: GET /health 127.0.0.1 200
slot      release: id  0 | task 0 | stop processing: n_past = 479, truncated = 0
srv    send_error: task id = 0, error: context shift is disabled
srv  update_slots: no tokens to decode
srv  update_slots: all slots are idle
srv  cancel_tasks: cancel task, id_task = 0
srv  update_slots: all slots are idle
request: POST /v1/chat/completions 117.50.218.103 200
request: GET /health 127.0.0.1 200

Could you help take a look at what might be causing this? The model seems to either get stuck in a loop with context shifting enabled, or fails to complete generation when it's disabled.

Really appreciate your time and help on this!

This maybe due to context shifting (but we can't be sure because you didn't post the server log)

Try adding --no-context-shift to disable it.

wooooyeahhhh commented 2 hours ago

Probably lack of context right? Try increasing the context size if you can, its only 480 per slot, or try reducing the number of slots

ngxson commented 2 hours ago

Could you help take a look at what might be causing this? The model seems to either get stuck in a loop with context shifting enabled, or fails to complete generation when it's disabled.

Your context size -c is small, so you should either increase it or decrease number of slot as @wooooyeahhhh explained.

The model generate repeated response with context shift because context shift discard old tokens once the context is full. This makes the model to forget what it said and repeat the same phrase.

e1ijah1 commented 1 hour ago

Thanks for your help! I tried reducing the number of slots and now it's working properly. Really appreciate your responses!