NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.71k stars 996 forks source link

How to make sure enable_kv_cache_reuse working correctly? #2462

Open chwma0 opened 3 days ago

chwma0 commented 3 days ago

System Info

Who can help?

@kaiyux How to use a simple method to prove that enable_kv_cache_reuse is working correctly?

Information

Tasks

Reproduction

Steps

  1. convert model: CUDA_VISIBLE_DEVICES=1 python3 ./examples/qwen/convert_checkpoint.py --model_dir Qwen2_1.5B_Instruct --output_dir qwen/gpu1_fp16/ckp --dtype float16 --use_parallel_embedding
  2. trtllm build: CUDA_VISIBLE_DEVICES=1 trtllm-build --checkpoint_dir qwen/gpu1_fp16/ckp --output_dir qwen/gpu1_fp16/engine_reuse32 --gemm_plugin float16 --gpt_attention_plugin float16 --remove_input_padding enable --max_input_len 4096 --max_seq_len 4096 --max_beam_width 1 --max_batch_size 4 --gather_generation_logits --use_paged_context_fmha enable --tokens_per_block 32
  3. tritonserver: parameters: { key: "enable_kv_cache_reuse" value: { string_value: "true" } } docker run -it --name ${name} --runtime nvidia --gpus all \ -v deploy/dependence:/app/dependence \ --shm-size=6g \ --ipc=host \ --privileged \ --net host \ --workdir /app/dependence/tensorrtllm_backend \ --entrypoint /bin/bash \ nvcr.io/nvidia/tritonserver:24.08-trtllm-python-py3
  4. runtime: CUDA_VISIBLE_DEVICES=3 nohup python3 scripts/launch_triton_server.py --world_size=1 --model_repo=/app/dependence/tensorrtllm_backend/tritonserver_config/qwen --grpc_port 2400 --http_port 2401 --metrics_port 2402 1>log.txt 2>&1 &
  5. client: python3 inflight_batcher_llm/client/inflight_batcher_llm_client.py --input-tokens-csv input_tokens.csv --url localhost:2400 --request-output-len 2000
  6. Statistics time consumption: inflight_batcher_llm/client/inflight_batcher_llm_client.py

    Send request

            beg = int(round(time.time() * 1000))
                    xxxxxxxxxxxxxxxx
                    processed_count = processed_count + 1
            end = int(round(time.time() * 1000))
            dura = end - beg
            print("client infer cost", dura)
    except Exception as e:

Expected behavior

The client accesses the server multiple times, and the subsequent time consumption becomes less and less.

actual behavior

There is almost no change in the time cost

4:client infer cost 10279 9:client infer cost 10318 14:client infer cost 10339 19:client infer cost 10330 24:client infer cost 10329 29:client infer cost 10367 34:client infer cost 10362 39:client infer cost 10343 44:client infer cost 10389 49:client infer cost 10367 54:client infer cost 10367 59:client infer cost 10366

additional notes

no