Suppose text1 is the prompt (context + question) of request 1 and text2 is the output message of request 1. Then text1 + text2 is the context of request 2. In this logic, I ran five-turn tests. The number of tokens goes from <100 to about 30000.
Performance:
TTFT didn't reduce
TTFT1
TTFT2
TTFT3
TTFT4
TTFT5
save_decode_cache: True
0.335
0.275
0.849
1.497
3.759
save_decode_cache: False
0.107
0.282
0.689
1.554
3.792
From stderr.log, retrieved number of chunks: 0 0 2 2 2
save_decode_cache = True should retrieve more chunks
stderr.log
, retrieved number of chunks: 0 0 2 2 2save_decode_cache = True
should retrieve more chunkssave_decode_cache = False
shouldn't retrieve chunksstdout.log
, last four id (after 28705) inprompt_token_ids
is missing in the next turn of request