System Info

nvcr.io/nvidia/tritonserver:24.08-trtllm-python-py3

Who can help?

@kaiyux How to use a simple method to prove that enable_kv_cache_reuse is working correctly?

Information

[x] The official example scripts
[ ] My own modified scripts

Tasks

[x] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Steps

convert model: CUDA_VISIBLE_DEVICES=1 python3 ./examples/qwen/convert_checkpoint.py --model_dir Qwen2_1.5B_Instruct --output_dir qwen/gpu1_fp16/ckp --dtype float16 --use_parallel_embedding
trtllm build: CUDA_VISIBLE_DEVICES=1 trtllm-build --checkpoint_dir qwen/gpu1_fp16/ckp --output_dir qwen/gpu1_fp16/engine_reuse32 --gemm_plugin float16 --gpt_attention_plugin float16 --remove_input_padding enable --max_input_len 4096 --max_seq_len 4096 --max_beam_width 1 --max_batch_size 4 --gather_generation_logits --use_paged_context_fmha enable --tokens_per_block 32
tritonserver: parameters: { key: "enable_kv_cache_reuse" value: { string_value: "true" } } docker run -it --name ${name} --runtime nvidia --gpus all \ -v deploy/dependence:/app/dependence \ --shm-size=6g \ --ipc=host \ --privileged \ --net host \ --workdir /app/dependence/tensorrtllm_backend \ --entrypoint /bin/bash \ nvcr.io/nvidia/tritonserver:24.08-trtllm-python-py3
runtime: CUDA_VISIBLE_DEVICES=3 nohup python3 scripts/launch_triton_server.py --world_size=1 --model_repo=/app/dependence/tensorrtllm_backend/tritonserver_config/qwen --grpc_port 2400 --http_port 2401 --metrics_port 2402 1>log.txt 2>&1 &
client: python3 inflight_batcher_llm/client/inflight_batcher_llm_client.py --input-tokens-csv input_tokens.csv --url localhost:2400 --request-output-len 2000

Statistics time consumption: inflight_batcher_llm/client/inflight_batcher_llm_client.py

Send request

        beg = int(round(time.time() * 1000))
                xxxxxxxxxxxxxxxx
                processed_count = processed_count + 1
        end = int(round(time.time() * 1000))
        dura = end - beg
        print("client infer cost", dura)
except Exception as e:

Expected behavior

The client accesses the server multiple times, and the subsequent time consumption becomes less and less.

actual behavior

There is almost no change in the time cost

4:client infer cost 10279 9:client infer cost 10318 14:client infer cost 10339 19:client infer cost 10330 24:client infer cost 10329 29:client infer cost 10367 34:client infer cost 10362 39:client infer cost 10343 44:client infer cost 10389 49:client infer cost 10367 54:client infer cost 10367 59:client infer cost 10366

NVIDIA / TensorRT-LLM

How to make sure enable_kv_cache_reuse working correctly? #2462