TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
MahmoudAshraf97 commented 1 month ago

System Info

Build using the official example instructions and switch remove_input_padding and paged_kv_cache to enable

trtllm-build  --checkpoint_dir ${checkpoint_dir}/decoder \
              --output_dir ${output_dir}/decoder \
              --paged_kv_cache enable \
              --moe_plugin disable \
              --enable_xqa disable \
              --use_custom_all_reduce disable \
              --max_beam_width ${MAX_BEAM_WIDTH} \
              --max_batch_size ${MAX_BATCH_SIZE} \
              --max_seq_len 114 \
              --max_input_len 14 \
              --max_encoder_input_len 1500 \
              --gemm_plugin ${INFERENCE_PRECISION} \
              --bert_attention_plugin ${INFERENCE_PRECISION} \
              --gpt_attention_plugin ${INFERENCE_PRECISION} \
              --remove_input_padding enable

then load the model using the class in run.py

Expected behavior

The model should load fine

actual behavior

[07/10/2024-20:02:34] [TRT-LLM] [E] The following expected tensors are not found: {'past_key_value_1', 'cross_past_key_value_1', 'past_key_value_0', 'present_key_value_0', 'present_key_value_1', 'cross_past_key_value_0', 'cross_present_key_value_0', 'cross_present_key_value_1'}
[07/10/2024-20:02:34] [TRT-LLM] [E] Those tensors in engine are not expected: {'host_kv_cache_block_offsets', 'kv_cache_block_offsets', 'host_kv_cache_pool_pointers', 'host_cross_kv_cache_pool_pointers', 'host_cross_kv_cache_block_offsets', 'cross_kv_cache_block_offsets'}
[07/10/2024-20:02:34] [TRT-LLM] [E] Expected tensor names: ['input_ids', 'logits', 'last_token_ids', 'position_ids', 'cache_indirection', 'past_key_value_0', 'present_key_value_0', 'past_key_value_1', 'present_key_value_1', 'cross_present_key_value_0', 'cross_past_key_value_0', 'cross_present_key_value_1', 'cross_past_key_value_1', 'sequence_length', 'context_lengths', 'host_request_types', 'host_past_key_value_lengths', 'host_sink_token_length', 'host_max_attention_window_sizes', 'host_context_lengths', 'encoder_output', 'encoder_input_lengths', 'encoder_max_input_length', 'cross_kv_cache_gen']
[07/10/2024-20:02:34] [TRT-LLM] [E] Found tensor names: ['input_ids', 'position_ids', 'encoder_input_lengths', 'encoder_max_input_length', 'encoder_output', 'host_past_key_value_lengths', 'host_context_lengths', 'sequence_length', 'context_lengths', 'host_request_types', 'last_token_ids', 'cache_indirection', 'host_max_attention_window_sizes', 'host_sink_token_length', 'kv_cache_block_offsets', 'host_kv_cache_block_offsets', 'host_kv_cache_pool_pointers', 'cross_kv_cache_block_offsets', 'host_cross_kv_cache_block_offsets', 'host_cross_kv_cache_pool_pointers', 'cross_kv_cache_gen', 'logits']
additional notes

I build with kv cache enabled to use in-flight batching, it's not in a usable state for now but this is for another issue check #1909

yuekaizhang commented 1 month ago

@MahmoudAshraf97 I am investigating the issue now and would update here later.

MahmoudAshraf97 commented 1 month ago

Reproduced with 0.12.0.dev2024071600

yuekaizhang commented 1 month ago

Reproduced with 0.12.0.dev2024071600

@MahmoudAshraf97 Yeah, the fixed codes have not been merged into main yet. Let me tell you here once it got merged.