NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.1k stars 895 forks source link

failed to load whisper decoder engine with paged kv cache #1930

Open MahmoudAshraf97 opened 1 month ago

MahmoudAshraf97 commented 1 month ago

System Info

Who can help?

@byshiue

Information

Tasks

Reproduction

Build using the official example instructions and switch remove_input_padding and paged_kv_cache to enable

INFERENCE_PRECISION=float16
WEIGHT_ONLY_PRECISION=int8
MAX_BEAM_WIDTH=4
MAX_BATCH_SIZE=8
checkpoint_dir=distil_whisper_medium_en_weights_${WEIGHT_ONLY_PRECISION}
output_dir=distil_whisper_medium_en_${WEIGHT_ONLY_PRECISION}
trtllm-build  --checkpoint_dir ${checkpoint_dir}/decoder \
              --output_dir ${output_dir}/decoder \
              --paged_kv_cache enable \
              --moe_plugin disable \
              --enable_xqa disable \
              --use_custom_all_reduce disable \
              --max_beam_width ${MAX_BEAM_WIDTH} \
              --max_batch_size ${MAX_BATCH_SIZE} \
              --max_seq_len 114 \
              --max_input_len 14 \
              --max_encoder_input_len 1500 \
              --gemm_plugin ${INFERENCE_PRECISION} \
              --bert_attention_plugin ${INFERENCE_PRECISION} \
              --gpt_attention_plugin ${INFERENCE_PRECISION} \
              --remove_input_padding enable

then load the model using the class in run.py

Expected behavior

The model should load fine

actual behavior

[07/10/2024-20:02:34] [TRT-LLM] [E] The following expected tensors are not found: {'past_key_value_1', 'cross_past_key_value_1', 'past_key_value_0', 'present_key_value_0', 'present_key_value_1', 'cross_past_key_value_0', 'cross_present_key_value_0', 'cross_present_key_value_1'}
[07/10/2024-20:02:34] [TRT-LLM] [E] Those tensors in engine are not expected: {'host_kv_cache_block_offsets', 'kv_cache_block_offsets', 'host_kv_cache_pool_pointers', 'host_cross_kv_cache_pool_pointers', 'host_cross_kv_cache_block_offsets', 'cross_kv_cache_block_offsets'}
[07/10/2024-20:02:34] [TRT-LLM] [E] Expected tensor names: ['input_ids', 'logits', 'last_token_ids', 'position_ids', 'cache_indirection', 'past_key_value_0', 'present_key_value_0', 'past_key_value_1', 'present_key_value_1', 'cross_present_key_value_0', 'cross_past_key_value_0', 'cross_present_key_value_1', 'cross_past_key_value_1', 'sequence_length', 'context_lengths', 'host_request_types', 'host_past_key_value_lengths', 'host_sink_token_length', 'host_max_attention_window_sizes', 'host_context_lengths', 'encoder_output', 'encoder_input_lengths', 'encoder_max_input_length', 'cross_kv_cache_gen']
[07/10/2024-20:02:34] [TRT-LLM] [E] Found tensor names: ['input_ids', 'position_ids', 'encoder_input_lengths', 'encoder_max_input_length', 'encoder_output', 'host_past_key_value_lengths', 'host_context_lengths', 'sequence_length', 'context_lengths', 'host_request_types', 'last_token_ids', 'cache_indirection', 'host_max_attention_window_sizes', 'host_sink_token_length', 'kv_cache_block_offsets', 'host_kv_cache_block_offsets', 'host_kv_cache_pool_pointers', 'cross_kv_cache_block_offsets', 'host_cross_kv_cache_block_offsets', 'host_cross_kv_cache_pool_pointers', 'cross_kv_cache_gen', 'logits']
{
    "name": "RuntimeError",
    "message": "Tensor names in engine are not the same as expected, to use this GenerationSession, you need to use PretrainedModel.prepare_inputs to create TRT Network inputs.",
    "stack": "---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[3], line 15
     12 accuracy_check = False  # Change to True for CI test accuracy check
     14 tensorrt_llm.logger.set_level(log_level)
---> 15 model = WhisperTRTLLM(engine_dir, debug, assets_dir)
     16 normalizer = EnglishTextNormalizer()

Cell In[2], line 172, in WhisperTRTLLM.__init__(self, engine_dir, debug_mode, assets_dir)
    169 engine_dir = Path(engine_dir)
    171 self.encoder = WhisperEncoding(engine_dir)
--> 172 self.decoder = WhisperDecoding(engine_dir, runtime_mapping, debug_mode=False)
    173 is_multilingual = self.decoder.decoder_config[\"vocab_size\"] >= 51865
    174 if is_multilingual:

Cell In[2], line 57, in WhisperDecoding.__init__(self, engine_dir, runtime_mapping, debug_mode)
     54 def __init__(self, engine_dir, runtime_mapping, debug_mode=False):
     56     self.decoder_config = self.get_config(engine_dir)
---> 57     self.decoder_generation_session = self.get_session(
     58         engine_dir, runtime_mapping, debug_mode
     59     )

Cell In[2], line 93, in WhisperDecoding.get_session(self, engine_dir, runtime_mapping, debug_mode)
     73     decoder_engine_buffer = f.read()
     75 decoder_model_config = ModelConfig(
     76     max_batch_size=self.decoder_config[\"max_batch_size\"],
     77     max_beam_width=self.decoder_config[\"max_beam_width\"],
   (...)
     91     has_token_type_embedding=False,
     92 )
---> 93 decoder_generation_session = tensorrt_llm.runtime.GenerationSession(
     94     decoder_model_config,
     95     decoder_engine_buffer,
     96     runtime_mapping,
     97     debug_mode=debug_mode,
     98 )
    100 return decoder_generation_session

File ~/.local/lib/python3.10/site-packages/tensorrt_llm/runtime/generation.py:863, in GenerationSession.__init__(self, model_config, engine_buffer, mapping, debug_mode, debug_tensors_to_save, cuda_graph_mode, stream)
    861     logger.error(f\"Expected tensor names: {expected_tensor_names}\")
    862     logger.error(f\"Found tensor names: {found_tensor_names}\")
--> 863     raise RuntimeError(
    864         \"Tensor names in engine are not the same as expected, to use this GenerationSession, \"
    865         \"you need to use PretrainedModel.prepare_inputs to create TRT Network inputs.\"
    866     )
    867 if self.debug_mode:
    868     self.debug_tensors = list(
    869         set(found_tensor_names) - set(expected_tensor_names))

RuntimeError: Tensor names in engine are not the same as expected, to use this GenerationSession, you need to use PretrainedModel.prepare_inputs to create TRT Network inputs."
}

additional notes

I build with kv cache enabled to use in-flight batching, it's not in a usable state for now but this is for another issue check #1909

yuekaizhang commented 1 month ago

@MahmoudAshraf97 I am investigating the issue now and would update here later.

MahmoudAshraf97 commented 1 month ago

Reproduced with 0.12.0.dev2024071600

yuekaizhang commented 1 month ago

Reproduced with 0.12.0.dev2024071600

@MahmoudAshraf97 Yeah, the fixed codes have not been merged into main yet. Let me tell you here once it got merged.