kv_cache_reuse breaking on awq quantized model

Bhuvanesh09 commented 2 weeks ago

System Info

X86_64
RAM: 30 GB
GPU: A10G, VRAM: 23GB
Lib: Tensorrt-LLM v0.9.0
Container Used: nvcr.io/nvidia/tritonserver:24.05-trtllm-python-py3
Model used: Mistral 7B

Who can help?

@Tracin , @kaiyux , @byshiue

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Quantized with: the command:

python ../quantization/quantize.py --model_dir <model_fp16> \
                                       --dtype float16 \
                                       --qformat int4_awq \
                                       --awq_block_size 128 \
                                       --output_dir <model_repo> \
                                       --calib_size 32 

trtllm-build --checkpoint_dir <model_repo> \
                 --output_dir <engine_repo> \
                 --gemm_plugin float16 --use_paged_context_fmha enable --max_input_len 4000 --max_output_len 400 --max_batch_size 12

Started the model with arguments:

7 python3 tools/fill_template.py -i test-model/preprocessing/config.pbtxt tokenizer_dir:${TOK_PATH},triton_max_batch_size:12,preproc
    essing_instance_count:1
8 python3 tools/fill_template.py -i test-model/postprocessing/config.pbtxt tokenizer_dir:${TOK_PATH},triton_max_batch_size:12,postpr
    ocessing_instance_count:1
9 python3 tools/fill_template.py -i test-model/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:12,decoupled_mode:False,bls_insta
    nce_count:1,accumulate_tokens:False
10 python3 tools/fill_template.py -i test-model/ensemble/config.pbtxt triton_max_batch_size:12
11 python3 tools/fill_template.py -i test-model/tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,triton_max_batch_size:12,decoupl
          ed_mode:False,max_beam_width:1,engine_dir:${ENGINE_PATH},max_tokens_in_paged_kv_cache:52000,max_attention_window_size:4096,kv_cach
          e_free_gpu_mem_fraction:0.9,exclude_input_in_output:True,enable_kv_cache_reuse:True,batching_strategy:inflight_fused_batching,max
          _queue_delay_microseconds:0

How to get the error:

When tested with a semaphore of 10(Ensuring 10 requests are always pending at the server), after a few successful predictions, we get the error:

[TensorRT-LLM][ERROR] Encountered error for requestId 991505662: Encountered an error in forward function: [TensorRT-LLM][ERROR] Asserti
on failed: blockedTokens.size() <= blockIds.size() (/tmp/tritonbuild/tensorrtllm/tensorrt_llm/cpp/tensorrt_llm/batch_manager/kvCacheMana
ger.cpp:568)

Information which might help in debugging:

The requests get dropped and the server stops working only when the initial set of kv cache in its entirety is full. The server is unable to kickout the LRU kv_cache in paged attention as it is supposed to do. This can be confirmed by the fact that the server runs without any issues when enable_kv_cache_reuse is set to off.

Expected behavior

The Model should continue to serve the requests without any issues.

actual behavior

We get the following error in the triton server:

[TensorRT-LLM][ERROR] Encountered error for requestId 991505662: Encountered an error in forward function: [TensorRT-LLM][ERROR] Asserti
on failed: blockedTokens.size() <= blockIds.size() (/tmp/tritonbuild/tensorrtllm/tensorrt_llm/cpp/tensorrt_llm/batch_manager/kvCacheMana
ger.cpp:568)

additional notes

Information which might help in debugging:

The requests get dropped and the server stops working only when the initial set of kv cache in its entirety is full. The server is unable to kickout the LRU kv_cache in paged attention as it is supposed to do. This can be confirmed by the fact that the server runs without any issues when enable_kv_cache_reuse is set to off.

QiJune commented 1 week ago

@Tracin Could you please take a look? Thanks.

BTW, @Bhuvanesh09 Could you please try the main branch? Or the 0.10.0 release branch?

Tracin commented 1 week ago

@Bhuvanesh09 I think kv_cache_reuse is orthogonal to AWQ quantization. For narrow down the issue, could you try with a full-precision model?

NVIDIA / TensorRT-LLM