NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
7.34k stars 794 forks source link

t5 model ,the inference result are wrong when the batch size > 1 #1847

Open 0xd8b opened 2 days ago

0xd8b commented 2 days ago

System Info

A100 Tensorrt_llm 0.7.0

Who can help?

@byshiue @sy

Information

Tasks

Reproduction

  1. Converted the T5-large model according to the official example, using GPT and BERT plugins with float16 precision. Inference works correctly when batch size is 1.

  2. When batch size > 1, e.g., batch size = 4, we observed that the self-attention results are correct for odd batch indices, but the output of self-attention is all zeros for even batch indices.

  3. We debugged the decoder separately (using the T5 encoder output from HF as the input for the decoder) and found that self-attention in the decoder works correctly. However, in the cross-attention, the results are correct for odd batch indices, but the output is all zeros for even batch indices.

The above phenomenon only occurs when using the BERT and GPT plugins; it does not occur in plain TensorRT mode.

Expected behavior

When the batch size is greater than 1, using the BERT and GPT plugins in the T5 model shows significant abnormalities, where certain dimensions of the attention output are entirely zeros.

actual behavior

When the batch size is greater than 1, using the BERT and GPT plugins in the T5 model shows significant abnormalities, where certain dimensions of the attention output are entirely zeros.

additional notes

When the batch size is greater than 1, inference in the T5 family models behaves abnormally.

nv-guomingz commented 2 days ago

It seems that you're using a very outdated version (0.7.0), could u please try the latest main branch code?

0xd8b commented 2 days ago

@nv-guomingz Yes, due to historical reasons, we developed on version 0.7.0. We have not seen anyone report this issue in the issues section, and it is possible that this problem still exists in the newer versions. Therefore, we hope to address this problem in version 0.7.0. Have you ever encountered a similar issue?

nv-guomingz commented 2 days ago

For me, I can't recall there's such issue for T5 on 0.7.0 version.

Would u please provide us step-by-step instructions for reproducing such issue?

I still suggest you try the latest release whl instead of using 0.7.0 to see if the issue still exists or not.

If so, we'll file a bug for internal tracking and investigating.

0xd8b commented 2 days ago

@nv-guomingz Just now, we used TensorRT LLM version 0.90 and converted T5-large using the official example (example/enc_dec/).

First, we followed the official example to convert the weights to float16. Then, we used build.py to build the engine with batch_size=4, using the GPT plugin and the BERT plugin, with float16 precision, keeping everything else consistent with the official example. We used run.py for inference. The inference results are abnormal in the even batch dimensions and correct in the odd batch dimensions. We have identified that the issue lies with the self-attention output in the BERT plugin being abnormal. In the GPT plugin, the self-attention output is normal, but the cross-attention output is abnormal, with the values in the even batch dimensions being all zeros. This might be a bug in the plugin.

hijkzzz commented 2 days ago

Could you try the latest version TRT_LLM 0.11+ see the tutorial: https://nvidia.github.io/TensorRT-LLM/installation/linux.html

1096125073 commented 22 hours ago

i have the same issuse when use gpt_attention plugin 6 A10
2tp 3pp 企业微信截图_17195681105924

1096125073 commented 22 hours ago

tenosrrt-llm 0.10.0

symphonylyh commented 14 hours ago

Hi @0xd8b @1096125073 , can you please provide your trt-llm version, runtime type (python or pybind of C++), model name, TP/PP setup, beam search, reproducible input examples (English preferred)? Because on our end we wasn't seeing any issue with BS>1 Examples: 0.10.0, pybind of C++, google/t5-large, TP=1 PP=1, no beam search, ["xxx", "yyy", "zzz"] And if you can reproduce your issue on TP=1 PP=1, please provide an example under this config -- it's easier for debug

0xd8b commented 12 hours ago

sorry, I did not provide detailed information earlier. here is the related information:

  1. First, we fine-tuned the t5-large network without changing the decoder's architecture.
  2. We used float32 precision during training and float16 precision during engine conversion.
  3. The engine conversion configurations are as follows:

    tensorrt_llm versions: 0.7.0 and 0.9.0

--world_size=1 --tp_size=1 --pp_size=1 --gpus_per_node=8 --parallel_build=False --weight_from_pytorch_ckpt=False --engine_name="t5-small" --debug_mode=False --timing_cache="model.cache" --model_type="t5" --dtype="float16" --logits_dtype="float16" --log_level="info" --max_batch_size=4 --max_encoder_input_len=1500 --max_decoder_input_len=1 --max_output_len=200 --max_beam_width=1 --use_bert_attention_plugin="float16" --use_gpt_attention_plugin="float16" --use_gemm_plugin="float16" --use_layernorm_plugin=False --use_rmsnorm_plugin=False --use_lookup_plugin=False --enable_qk_half_accum=False --builder_opt=None --remove_input_padding=False --random_seed=None --use_parallel_embedding=False --embedding_sharding_dim=0 --use_custom_all_reduce=False --strongly_typed=True --gather_all_token_logits=False