ggerganov / llama.cpp

LLM inference in C/C++
MIT License
65.84k stars 9.45k forks source link

update_slots : failed to decode the batch #4185

Closed rvandernoort closed 6 months ago

rvandernoort commented 10 months ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

Expected Behavior

Please provide a detailed written description of what you were trying to do, and what you expected llama.cpp to do.

Run 1000 prompts over an hour using the server.

Current Behavior

Please provide a detailed written description of what llama.cpp did, instead.

Around 800 the KV cache seems to be filled up and the whole server application hangs infinitely and does not accept any requests anymore.

Environment and Context

Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.

$ lscpu

Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         48 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  24
  On-line CPU(s) list:   0-23
Vendor ID:               AuthenticAMD
  Model name:            AMD Ryzen 9 7900X 12-Core Processor
    CPU family:          25
    Model:               97
    Thread(s) per core:  2
    Core(s) per socket:  12
    Socket(s):           1
    Stepping:            2
    Frequency boost:     enabled
    CPU max MHz:         5732,7139
    CPU min MHz:         3000,0000
    BogoMIPS:            9381.89
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc
                          rep_good amd_lbr_v2 nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f
                         16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr
                         _llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdsee
                         d adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx51
                         2_bf16 clzero irperf xsaveerptr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold av
                         ic v_vmsave_vmload vgif x2avic v_spec_ctrl avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid overfl
                         ow_recov succor smca fsrm flush_l1d
Virtualization features: 
  Virtualization:        AMD-V
Caches (sum of all):     
  L1d:                   384 KiB (12 instances)
  L1i:                   384 KiB (12 instances)
  L2:                    12 MiB (12 instances)
  L3:                    64 MiB (2 instances)
NUMA:                    
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-23

$ uname -a

Linux GreenServer 6.2.0-34-generic #34~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Sep  7 13:12:03 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
$ python3 --version
Python 3.10.12
$ make --version
GNU Make 4.3
$ g++ --version
g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

Failure Information (for bugs)

Please help provide information about the failure / bug.

So I'm running a large quantity of inferences via requests to the server and the server accepts many but eventually fails to find free space in the KV cache. Based on this comment I increased the context size of the model to 2048 and this significantly increased the amount of requests it will resolve before hanging again.

Steps to Reproduce

Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.

  1. ./server -m /models/TinyLLama/original/ggml-model-f32.gguf -t 12 -ngl 99 -c 2048
  2. send data using python requests with this data format
    data = {
        "system_prompt": {
            "prompt": system_prompt,
            "anti_prompt": "User:",
            "assistant_name": "Assistant:"
        },
        "prompt": prompt,
        "n_predict": n_predict,
    }
  3. send 1000
  4. fail

Failure Logs

llama.cpp-gpu  | system prompt updated
llama.cpp-gpu  | slot 0 is processing [task id: 812]
llama.cpp-gpu  | slot 0 : kv cache rm - [24, end)
llama.cpp-gpu  | update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 256
llama.cpp-gpu  | update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 128
llama.cpp-gpu  | update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 64
llama.cpp-gpu  | update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 32
llama.cpp-gpu  | update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 16
llama.cpp-gpu  | update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 8
llama.cpp-gpu  | update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 4
llama.cpp-gpu  | update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 2
llama.cpp-gpu  | update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 1
llama.cpp-gpu  | update_slots : failed to decode the batch, n_batch = 1, ret = 1

while everything before that works fine like:

llama.cpp-gpu  | system prompt updated
llama.cpp-gpu  | slot 0 is processing [task id: 810]
llama.cpp-gpu  | slot 0 : kv cache rm - [24, end)
llama.cpp-gpu  | 
llama.cpp-gpu  | print_timings: prompt eval time =      16.25 ms /    99 tokens (    0.16 ms per token,  6091.56 tokens per second)
llama.cpp-gpu  | print_timings:        eval time =     774.51 ms /   128 runs   (    6.05 ms per token,   165.27 tokens per second)
llama.cpp-gpu  | print_timings:       total time =     790.76 ms
llama.cpp-gpu  | slot 0 released (228 tokens in cache)
llama.cpp-gpu  | updating system prompt
llama.cpp-gpu  | {"timestamp":1700746397,"level":"INFO","function":"log_server_request","line":2215,"message":"request","remote_addr":"127.0.0.1","remote_port":35098,"status":200,"method":"POST","path":"/completion","params":{}}
llama.cpp-gpu  | system prompt updated
llama.cpp-gpu  | slot 0 is processing [task id: 811]
llama.cpp-gpu  | slot 0 : kv cache rm - [24, end)
llama.cpp-gpu  | 
llama.cpp-gpu  | print_timings: prompt eval time =      61.90 ms /   881 tokens (    0.07 ms per token, 14232.63 tokens per second)
llama.cpp-gpu  | print_timings:        eval time =     807.97 ms /   128 runs   (    6.31 ms per token,   158.42 tokens per second)
llama.cpp-gpu  | print_timings:       total time =     869.87 ms
llama.cpp-gpu  | slot 0 released (1010 tokens in cache)
llama.cpp-gpu  | updating system prompt
llama.cpp-gpu  | {"timestamp":1700746402,"level":"INFO","function":"log_server_request","line":2215,"message":"request","remote_addr":"127.0.0.1","remote_port":35114,"status":200,"method":"POST","path":"/completion","params":{}}

Is there something I'm not doing correctly and this is expected behaviour? Something with the system prompts (my dataset has a variaty of system prompts, that get set for every request), should it not flush the cache if it does not fit anymore? If you need more info let me know.

KerfuffleV2 commented 10 months ago

Is there something I'm not doing correctly and this is expected behaviour?

It would be expected behavior if the KV cache is actually full. Since you set the context size to 2,048 then that would mean that when generated tokens + prompt tokens add up to 2,048 the cache is full and it won't be possible to find a slot.

rvandernoort commented 10 months ago

Since you set the context size to 2,048 then that would mean that when generated tokens + prompt tokens add up to 2,048 the cache is full and it won't be possible to find a slot.

Hey, thanks for your response! Okay, it seems true that it only happens for a single entry in my dataset, which then means that that entry overflows the cache. But upping the context size to 4096 does not seem to allow this entry. Also, the logs show slot 0 released (1010 tokens in cache) and I've set the predict_n variable to 128, that does not sum to 2048 right?

EDIT: Okay nvm it seems it is about the next entry which didn't get logged, which is too big. Nontheless, should the server not return an error instead of hanging on this error?

KerfuffleV2 commented 10 months ago

Nontheless, should the server not return an error instead of hanging on this error?

Can't argue with that. :)

FSSRepo commented 10 months ago

@rvandernoort To avoid such errors, you should refrain from sending the system prompt every time you make a request. It is only required once, and subsequent requests will maintain the same system prompt without the need to resend it repeatedly.

rvandernoort commented 10 months ago

@rvandernoort To avoid such errors, you should refrain from sending the system prompt every time you make a request. It is only required once, and subsequent requests will maintain the same system prompt without the need to resend it repeatedly.

Thanks for your suggestion. However, I'm trying to run a dataset with a high variability in system prompt per prompt. Does this have any impact on the KV cache problem or is it just a matter of better performance if I don't send the system prompt? Generally, the system prompts are very small though and I think the issue lies with large normal prompts.

the-crypt-keeper commented 7 months ago

Using latest commit as of this writing: e00d2a62dd1441e3b089570ec06d05c18800d368

After 366 successful generations of which 5 had successful context shifts, the bug strikes:

[1707487639] slot 0 released (1871 tokens in cache)
[1707487640] slot 0 is processing [task id: 30069]
[1707487640] slot 0 : kv cache rm - [0, end)
[1707487647] sampled token:   243: '<F0>'
[1707487647] sampled token:   162: '<9F>'
[1707487647] sampled token:   144: '<8D>'
[1707487647] sampled token:   194: '<BF>'
[1707487647] sampled token: 27067: ' Sounds'
[1707487647] sampled token:  3659: ' perfect'
[1707487647] sampled token: 28808: '!'
[1707487647] sampled token:  3169: ' Let'
[1707487647] sampled token: 28742: '''
[1707487647] sampled token: 28713: 's'
[1707487647] sampled token:  1300: ' find'
[1707487647] sampled token:   264: ' a'
[1707487647] sampled token:  1179: ' good'
[1707487647] sampled token:  5994: ' movie'
[1707487647] sampled token:   298: ' to'
[1707487647] sampled token:  3054: ' watch'
[1707487647] sampled token:   304: ' and'
[1707487647] sampled token: 28306: ' curl'
[1707487648] sampled token:   582: ' up'
[1707487648] update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 256
[1707487648] update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 128
[1707487648] update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 64
[1707487648] update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 32
[1707487648] update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 16
[1707487648] update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 8
[1707487648] update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 4
[1707487648] update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 2
[1707487648] update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 1
[1707487648] update_slots : failed to decode the batch, n_batch = 1, ret = 1

The update_slots messages repeat infinitely as others have reported.

Server was launched with:

/server -c 3500 -m ~/models/openhermes-2.5-mistral-7b.Q5_0.gguf -ngl 99

I unfortunately did not log the failing request so I cannot confirm this but I have a hunch this problem occurs when the context shift needs to happen during generation vs during prompt processing. I do not use system prompts at all!

github-actions[bot] commented 6 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.

enn-nafnlaus commented 5 months ago

I'm getting this same issue, though I'm not yet on the latest code (no internet access for my server at the moment, so I can't test to see if more recent versions fix it). Very annoying - this is not appropriate behavior. Appropriate behavior would be to just fail the query instead of getting stuck in an infinite loop.