Open 0xd8b opened 2 days ago
It seems that you're using a very outdated version (0.7.0), could u please try the latest main branch code?
@nv-guomingz Yes, due to historical reasons, we developed on version 0.7.0. We have not seen anyone report this issue in the issues section, and it is possible that this problem still exists in the newer versions. Therefore, we hope to address this problem in version 0.7.0. Have you ever encountered a similar issue?
For me, I can't recall there's such issue for T5 on 0.7.0 version.
Would u please provide us step-by-step instructions for reproducing such issue?
I still suggest you try the latest release whl instead of using 0.7.0 to see if the issue still exists or not.
If so, we'll file a bug for internal tracking and investigating.
@nv-guomingz Just now, we used TensorRT LLM version 0.90 and converted T5-large using the official example (example/enc_dec/).
First, we followed the official example to convert the weights to float16. Then, we used build.py to build the engine with batch_size=4, using the GPT plugin and the BERT plugin, with float16 precision, keeping everything else consistent with the official example. We used run.py for inference. The inference results are abnormal in the even batch dimensions and correct in the odd batch dimensions. We have identified that the issue lies with the self-attention output in the BERT plugin being abnormal. In the GPT plugin, the self-attention output is normal, but the cross-attention output is abnormal, with the values in the even batch dimensions being all zeros. This might be a bug in the plugin.
Could you try the latest version TRT_LLM 0.11+ see the tutorial: https://nvidia.github.io/TensorRT-LLM/installation/linux.html
i have the same issuse when use gpt_attention plugin
6 A10
2tp 3pp
tenosrrt-llm 0.10.0
Hi @0xd8b @1096125073 , can you please provide your trt-llm version, runtime type (python or pybind of C++), model name, TP/PP setup, beam search, reproducible input examples (English preferred)? Because on our end we wasn't seeing any issue with BS>1 Examples: 0.10.0, pybind of C++, google/t5-large, TP=1 PP=1, no beam search, ["xxx", "yyy", "zzz"] And if you can reproduce your issue on TP=1 PP=1, please provide an example under this config -- it's easier for debug
sorry, I did not provide detailed information earlier. here is the related information:
The engine conversion configurations are as follows:
tensorrt_llm versions: 0.7.0 and 0.9.0
--world_size=1 --tp_size=1 --pp_size=1 --gpus_per_node=8 --parallel_build=False --weight_from_pytorch_ckpt=False --engine_name="t5-small" --debug_mode=False --timing_cache="model.cache" --model_type="t5" --dtype="float16" --logits_dtype="float16" --log_level="info" --max_batch_size=4 --max_encoder_input_len=1500 --max_decoder_input_len=1 --max_output_len=200 --max_beam_width=1 --use_bert_attention_plugin="float16" --use_gpt_attention_plugin="float16" --use_gemm_plugin="float16" --use_layernorm_plugin=False --use_rmsnorm_plugin=False --use_lookup_plugin=False --enable_qk_half_accum=False --builder_opt=None --remove_input_padding=False --random_seed=None --use_parallel_embedding=False --embedding_sharding_dim=0 --use_custom_all_reduce=False --strongly_typed=True --gather_all_token_logits=False
System Info
A100 Tensorrt_llm 0.7.0
Who can help?
@byshiue @sy
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Converted the T5-large model according to the official example, using GPT and BERT plugins with float16 precision. Inference works correctly when batch size is 1.
When batch size > 1, e.g., batch size = 4, we observed that the self-attention results are correct for odd batch indices, but the output of self-attention is all zeros for even batch indices.
We debugged the decoder separately (using the T5 encoder output from HF as the input for the decoder) and found that self-attention in the decoder works correctly. However, in the cross-attention, the results are correct for odd batch indices, but the output is all zeros for even batch indices.
The above phenomenon only occurs when using the BERT and GPT plugins; it does not occur in plain TensorRT mode.
Expected behavior
When the batch size is greater than 1, using the BERT and GPT plugins in the T5 model shows significant abnormalities, where certain dimensions of the attention output are entirely zeros.
actual behavior
When the batch size is greater than 1, using the BERT and GPT plugins in the T5 model shows significant abnormalities, where certain dimensions of the attention output are entirely zeros.
additional notes
When the batch size is greater than 1, inference in the T5 family models behaves abnormally.