ggerganov / llama.cpp

LLM inference in C/C++
MIT License
64.97k stars 9.32k forks source link

Bug: Assert fail during KV cache slot restore when cache file is empty #9173

Open kaetemi opened 3 weeks ago

kaetemi commented 3 weeks ago

What happened?

Tripping on this line: https://github.com/ggerganov/llama.cpp/blob/a07c32ea54850c989f0ef6989da5b955b77b7172/src/llama.cpp#L18773C1-L18774C1

Meaning, the allocated slot space is mismatching with what was requested, even though the request did not return an error. Or, there was an invalid case in the request that is not being handled.

Name and Version

a07c32ea54850c989f0ef6989da5b955b77b7172

What operating system are you seeing the problem on?

No response

Relevant log output

0|pv_scheduler  | Saved slot 2753074044803 to pv_slot_2753074044803.ggsq, took 32.709 ms
0|pv_scheduler  | llama-server [falcon-7b:2048:6]: INFO [            update_slots] all slots are idle | tid="139875768709120" timestamp=1724618632
0|pv_scheduler  | llama-server [falcon-7b:2048:6]: INFO [      log_server_request] request | tid="139875617570816" timestamp=1724618632 remote_addr="127.0.0.1" remote_port=57838 status=200 method="POST" path="/slots/3" params={"action":"restore"}
0|pv_scheduler  | Restored slot 2753074044611 from pv_slot_2753074044611.ggsq, took 12.713 ms
0|pv_scheduler  | llama-server [falcon-7b:2048:6]: INFO [   launch_slot_with_task] slot is processing task | tid="139875768709120" timestamp=1724618632 id_slot=3 id_task=658
0|pv_scheduler  | llama-server [falcon-7b:2048:6]: INFO [            update_slots] kv cache rm [p0, end) | tid="139875768709120" timestamp=1724618632 id_slot=3 id_task=658 p0=1787
0|pv_scheduler  | llama-server [falcon-7b:2048:6]: INFO [           print_timings] prompt eval time     =     158.19 ms /     9 tokens (   17.58 ms per token,    56.89 tokens per second) | tid="139875768709120" timestamp=1724618633 id_slot=3 id_task=658 t_prompt_processing=158.187 n_prompt_tokens_processed=9 t_token=17.576333333333334 n_tokens_second=56.894687932636685
0|pv_scheduler  | llama-server [falcon-7b:2048:6]: INFO [           print_timings] generation eval time =      34.39 ms /     2 runs   (   17.20 ms per token,    58.15 tokens per second) | tid="139875768709120" timestamp=1724618633 id_slot=3 id_task=658 t_token_generation=34.393 n_decoded=2 t_token=17.1965 n_tokens_second=58.15136801093246
0|pv_scheduler  | llama-server [falcon-7b:2048:6]: INFO [           print_timings]           total time =     192.58 ms | tid="139875768709120" timestamp=1724618633 id_slot=3 id_task=658 t_prompt_processing=158.187 t_token_generation=34.393 t_total=192.58
0|pv_scheduler  | llama-server [falcon-7b:2048:6]: INFO [            update_slots] slot released | tid="139875768709120" timestamp=1724618633 id_slot=3 id_task=658 n_ctx=12288 n_past=1797 n_system_tokens=0 n_cache_tokens=1797 truncated=false
0|pv_scheduler  | llama-server [falcon-7b:2048:6]: INFO [            update_slots] all slots are idle | tid="139875768709120" timestamp=1724618633
0|pv_scheduler  | llama-server [falcon-7b:2048:6]: INFO [      log_server_request] request | tid="139875617570816" timestamp=1724618633 remote_addr="127.0.0.1" remote_port=57838 status=200 method="POST" path="/completion" params={}
0|pv_scheduler  | Request to /completion took 158.187 + 34.393 = 192.58 ms, 9 input tokens, 2 output tokens
0|pv_scheduler  | METRICS [falcon:7b-text] tok/s: { save: 77869.81, restore: 123323.00, prompt: 372.99, predicted: 35.67 }, req. success: 34/34, busy: 90.03%
0|pv_scheduler  | Process request for model: falcon-7b, cached, priority tier: 6, sequence number: 34, remaining queue: 3, prompt length: 2003 (tokens)
0|pv_scheduler  | llama-server [falcon-7b:2048:6]: INFO [            update_slots] all slots are idle | tid="139875768709120" timestamp=1724618633
0|pv_scheduler  | llama-server [falcon-7b:2048:6]: INFO [      log_server_request] request | tid="139875617570816" timestamp=1724618633 remote_addr="127.0.0.1" remote_port=57838 status=200 method="POST" path="/slots/3" params={"action":"save"}
0|pv_scheduler  | Saved slot 2753074044611 to pv_slot_2753074044611.ggsq, took 21.022 ms
0|pv_scheduler  | llama-server [falcon-7b:2048:6]: read_kv_cache_meta: failed to find available cells in kv cache
0|pv_scheduler  | No cache cells available for falcon-7b:2048:6, setting flag
0|pv_scheduler  | llama-server [falcon-7b:2048:6]: llama_state_seq_load_file: error loading sequence state file: failed to restore kv cache
0|pv_scheduler  | llama-server [falcon-7b:2048:6]:  ERR [              send_error] task error | tid="139875768709120" timestamp=1724618633 id_multi=-1 id_task=662 error="Unable to restore slot, no available space in KV cache or invalid slot save file"
0|pv_scheduler  | llama-server [falcon-7b:2048:6]: INFO [            update_slots] all slots are idle | tid="139875768709120" timestamp=1724618633
0|pv_scheduler  | llama-server [falcon-7b:2048:6]: INFO [      log_server_request] request | tid="139875617570816" timestamp=1724618633 remote_addr="127.0.0.1" remote_port=57838 status=400 method="POST" path="/slots/3" params={"action":"restore"}
0|pv_scheduler  | Retrying getPhysicalSlotId after no available space in KV cache
0|pv_scheduler  | Removed physical slot 3, evicted virtual slot 2753074046083. New physical slot count: 3
0|pv_scheduler  | llama-server [falcon-7b:2048:6]: INFO [            update_slots] all slots are idle | tid="139875768709120" timestamp=1724618633
0|pv_scheduler  | llama-server [falcon-7b:2048:6]: INFO [      log_server_request] request | tid="139875617570816" timestamp=1724618633 remote_addr="127.0.0.1" remote_port=57838 status=200 method="POST" path="/slots/3" params={"action":"save"}
0|pv_scheduler  | Saved slot 2753074046083 to pv_slot_2753074046083.ggsq, took 2.191 ms
0|pv_scheduler  | llama-server [falcon-7b:2048:6]: INFO [            update_slots] all slots are idle | tid="139875768709120" timestamp=1724618633
0|pv_scheduler  | Erased KV cache for physical slot 3
0|pv_scheduler  | llama-server [falcon-7b:2048:6]: INFO [      log_server_request] request | tid="139875609178112" timestamp=1724618633 remote_addr="127.0.0.1" remote_port=57848 status=200 method="POST" path="/slots/3" params={"action":"erase"}
0|pv_scheduler  | llama-server [falcon-7b:2048:6]: INFO [            update_slots] all slots are idle | tid="139875768709120" timestamp=1724618633
0|pv_scheduler  | llama-server [falcon-7b:2048:6]: INFO [      log_server_request] request | tid="139875609178112" timestamp=1724618633 remote_addr="127.0.0.1" remote_port=57848 status=200 method="POST" path="/slots/1" params={"action":"save"}
0|pv_scheduler  | Saved slot 2753074040867 to pv_slot_2753074040867.ggsq, took 32.368 ms
0|pv_scheduler  | llama-server [falcon-7b:2048:6]: /root/llama.cpp/src/llama.cpp:18773: GGML_ASSERT(kv_self.cells[kv_self.head].pos == batch.pos[0]) failed
0|pv_scheduler  | Error in request to /completion: read ECONNRESET
kaetemi commented 3 weeks ago

Bogus save on the slot that failed to restore seems to successfully save a blank cache for the slot.

Loading blank cache trips this check, it should probably just skip over everything and only do the erase step.