NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
7.51k stars 816 forks source link

kv_cache_reuse breaking on awq quantized model #1885

Open Bhuvanesh09 opened 2 weeks ago

Bhuvanesh09 commented 2 weeks ago

System Info

Who can help?

@Tracin , @kaiyux , @byshiue

Information

Tasks

Reproduction

Quantized with: the command:

python ../quantization/quantize.py --model_dir <model_fp16> \
                                       --dtype float16 \
                                       --qformat int4_awq \
                                       --awq_block_size 128 \
                                       --output_dir <model_repo> \
                                       --calib_size 32 

trtllm-build --checkpoint_dir <model_repo> \
                 --output_dir <engine_repo> \
                 --gemm_plugin float16 --use_paged_context_fmha enable --max_input_len 4000 --max_output_len 400 --max_batch_size 12

Started the model with arguments:

7 python3 tools/fill_template.py -i test-model/preprocessing/config.pbtxt tokenizer_dir:${TOK_PATH},triton_max_batch_size:12,preproc
    essing_instance_count:1
8 python3 tools/fill_template.py -i test-model/postprocessing/config.pbtxt tokenizer_dir:${TOK_PATH},triton_max_batch_size:12,postpr
    ocessing_instance_count:1
9 python3 tools/fill_template.py -i test-model/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:12,decoupled_mode:False,bls_insta
    nce_count:1,accumulate_tokens:False
10 python3 tools/fill_template.py -i test-model/ensemble/config.pbtxt triton_max_batch_size:12
11 python3 tools/fill_template.py -i test-model/tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,triton_max_batch_size:12,decoupl
          ed_mode:False,max_beam_width:1,engine_dir:${ENGINE_PATH},max_tokens_in_paged_kv_cache:52000,max_attention_window_size:4096,kv_cach
          e_free_gpu_mem_fraction:0.9,exclude_input_in_output:True,enable_kv_cache_reuse:True,batching_strategy:inflight_fused_batching,max
          _queue_delay_microseconds:0

How to get the error:

When tested with a semaphore of 10(Ensuring 10 requests are always pending at the server), after a few successful predictions, we get the error:

[TensorRT-LLM][ERROR] Encountered error for requestId 991505662: Encountered an error in forward function: [TensorRT-LLM][ERROR] Asserti
on failed: blockedTokens.size() <= blockIds.size() (/tmp/tritonbuild/tensorrtllm/tensorrt_llm/cpp/tensorrt_llm/batch_manager/kvCacheMana
ger.cpp:568)

Information which might help in debugging:

The requests get dropped and the server stops working only when the initial set of kv cache in its entirety is full. The server is unable to kickout the LRU kv_cache in paged attention as it is supposed to do. This can be confirmed by the fact that the server runs without any issues when enable_kv_cache_reuse is set to off.

Expected behavior

The Model should continue to serve the requests without any issues.

actual behavior

We get the following error in the triton server:

[TensorRT-LLM][ERROR] Encountered error for requestId 991505662: Encountered an error in forward function: [TensorRT-LLM][ERROR] Asserti
on failed: blockedTokens.size() <= blockIds.size() (/tmp/tritonbuild/tensorrtllm/tensorrt_llm/cpp/tensorrt_llm/batch_manager/kvCacheMana
ger.cpp:568)

additional notes

Information which might help in debugging:

The requests get dropped and the server stops working only when the initial set of kv cache in its entirety is full. The server is unable to kickout the LRU kv_cache in paged attention as it is supposed to do. This can be confirmed by the fact that the server runs without any issues when enable_kv_cache_reuse is set to off.

QiJune commented 1 week ago

@Tracin Could you please take a look? Thanks.

BTW, @Bhuvanesh09 Could you please try the main branch? Or the 0.10.0 release branch?

Tracin commented 1 week ago

@Bhuvanesh09 I think kv_cache_reuse is orthogonal to AWQ quantization. For narrow down the issue, could you try with a full-precision model?