NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.1k stars 894 forks source link

[0.11.0] T5 model running issue #1978

Open lanking520 opened 1 month ago

lanking520 commented 1 month ago

System Info

L4 GPU (AWS G6.12xl) with TensorRTLLM 0.11.0, running with Tritonbackends

Who can help?

No response

Information

Tasks

Reproduction

Convert checkpoint

convert_checkpoint.py --model_dir google/flan-t5-xl --dtype float16 --output_dir /tmp/trtllm_t5_ckpt/ --tp_size 4 --pp_size 1 --workers 4 --model_type t5

Build Encoder

trtllm-build --tp_size 4 --pp_size 1 --checkpoint_dir /tmp/trtllm_t5_ckpt/encoder/ --log_level info --gemm_plugin float16 --output_dir /tmp/.djl.ai/google-flan-t5-xl/1/encoder --workers 4 --gpt_attention_plugin float16 --paged_kv_cache disable --context_fmha disable --max_beam_width 1 --remove_input_padding enable --use_paged_context_fmha disable --use_fp8_context_fmha disable --max_batch_size 256 --max_input_len 1024 --max_num_tokens 16384 --enable_xqa disable --moe_plugin disable --use_custom_all_reduce disable

Build Decoder

trtllm-build --tp_size 4 --pp_size 1 --checkpoint_dir /tmp/trtllm_t5_ckpt/decoder/ --log_level info --gemm_plugin float16 --output_dir /tmp/.djl.ai/google-flan-t5-xl/1/decoder --workers 4 --gpt_attention_plugin float16 --paged_kv_cache enable --context_fmha disable --max_beam_width 1 --remove_input_padding enable --use_paged_context_fmha disable --use_fp8_context_fmha disable --max_batch_size 256 --max_input_len 1 --max_num_tokens 16384 --enable_xqa disable --moe_plugin disable --use_custom_all_reduce disable --max_encoder_input_len 1024 --max_seq_len 1024

Expected behavior

Should be running fine

actual behavior

During the decoder compilation, seeing:

[07/18/2024-02:10:56] [TRT-LLM] [W] Provided but not expected tensors: {'decoder_layers.0.self_attention.rel_attn_table', 'decoder_layers.1.self_attention.rel_attn_table', 'decoder_layers.6.self_attention.rel_attn_table', 'decoder_layers.23.self_attention.rel_attn_table', 'decoder_layers.20.self_attention.rel_attn_table', 'decoder_layers.2.self_attention.rel_attn_table', 'decoder_layers.16.self_attention.rel_attn_table', 'decoder_layers.19.self_attention.rel_attn_table', 'decoder_layers.10.self_attention.rel_attn_table', 'decoder_layers.7.self_attention.rel_attn_table', 'decoder_layers.11.self_attention.rel_attn_table', 'decoder_layers.12.self_attention.rel_attn_table', 'decoder_layers.17.self_attention.rel_attn_table', 'decoder_layers.4.self_attention.rel_attn_table', 'decoder_layers.5.self_attention.rel_attn_table', 'decoder_layers.18.self_attention.rel_attn_table', 'decoder_layers.15.self_attention.rel_attn_table', 'decoder_layers.9.self_attention.rel_attn_table', 'decoder_layers.13.self_attention.rel_attn_table', 'decoder_layers.21.self_attention.rel_attn_table', 'decoder_layers.8.self_attention.rel_attn_table', 'decoder_layers.22.self_attention.rel_attn_table', 'decoder_layers.14.self_attention.rel_attn_table', 'decoder_layers.3.self_attention.rel_attn_table'}
[07/18/2024-02:10:56] [TRT-LLM] [W] Provided but not expected tensors: {'decoder_layers.0.self_attention.rel_attn_table', 'decoder_layers.12.self_attention.rel_attn_table', 'decoder_layers.6.self_attention.rel_attn_table', 'decoder_layers.16.self_attention.rel_attn_table', 'decoder_layers.23.self_attention.rel_attn_table', 'decoder_layers.20.self_attention.rel_attn_table', 'decoder_layers.14.self_attention.rel_attn_table', 'decoder_layers.11.self_attention.rel_attn_table', 'decoder_layers.22.self_attention.rel_attn_table', 'decoder_layers.8.self_attention.rel_attn_table', 'decoder_layers.17.self_attention.rel_attn_table', 'decoder_layers.9.self_attention.rel_attn_table', 'decoder_layers.19.self_attention.rel_attn_table', 'decoder_layers.4.self_attention.rel_attn_table', 'decoder_layers.5.self_attention.rel_attn_table', 'decoder_layers.3.self_attention.rel_attn_table', 'decoder_layers.15.self_attention.rel_attn_table', 'decoder_layers.1.self_attention.rel_attn_table', 'decoder_layers.21.self_attention.rel_attn_table', 'decoder_layers.2.self_attention.rel_attn_table', 'decoder_layers.18.self_attention.rel_attn_table', 'decoder_layers.10.self_attention.rel_attn_table', 'decoder_layers.7.self_attention.rel_attn_table', 'decoder_layers.13.self_attention.rel_attn_table'}
[07/18/2024-02:10:56] [TRT-LLM] [W] Provided but not expected tensors: {'decoder_layers.6.self_attention.rel_attn_table', 'decoder_layers.7.self_attention.rel_attn_table', 'decoder_layers.11.self_attention.rel_attn_table', 'decoder_layers.2.self_attention.rel_attn_table', 'decoder_layers.15.self_attention.rel_attn_table', 'decoder_layers.4.self_attention.rel_attn_table', 'decoder_layers.9.self_attention.rel_attn_table', 'decoder_layers.0.self_attention.rel_attn_table', 'decoder_layers.17.self_attention.rel_attn_table', 'decoder_layers.5.self_attention.rel_attn_table', 'decoder_layers.13.self_attention.rel_attn_table', 'decoder_layers.20.self_attention.rel_attn_table', 'decoder_layers.14.self_attention.rel_attn_table', 'decoder_layers.1.self_attention.rel_attn_table', 'decoder_layers.21.self_attention.rel_attn_table', 'decoder_layers.19.self_attention.rel_attn_table', 'decoder_layers.16.self_attention.rel_attn_table', 'decoder_layers.12.self_attention.rel_attn_table', 'decoder_layers.10.self_attention.rel_attn_table', 'decoder_layers.3.self_attention.rel_attn_table', 'decoder_layers.23.self_attention.rel_attn_table', 'decoder_layers.8.self_attention.rel_attn_table', 'decoder_layers.18.self_attention.rel_attn_table', 'decoder_layers.22.self_attention.rel_attn_table'}

During the inference, getting

terminate called after throwing an instance of 'tensorrt_llm::common::TllmException'
  what():  [TensorRT-LLM][ERROR] CUDA runtime error in cublasGemmStridedBatchedEx(getCublasHandle(), transa, transb, m, n, k, alpha, A, AType, lda, strideA, B, BType, ldb, strideB, beta, C, CType, ldc, strideC, batchCount, computeType, mAType == CUDA_R_32F ? CUBLAS_GEMM_DEFAULT : CUBLAS_GEMM_DEFAULT_TENSOR_OP): CUBLAS_STATUS_EXECUTION_FAILED (/tmp/tensorrtllm_backend/tensorrt_llm/cpp/tensorrt_llm/common/cublasMMWrapper.cpp:206)
1       0x7f5000535e5f void tensorrt_llm::common::check<cublasStatus_t>(cublasStatus_t, char const*, char const*, int) + 175
2       0x7f50005340ff tensorrt_llm::common::CublasMMWrapper::stridedBatchedGemm(cublasOperation_t, cublasOperation_t, int, int, int, float, void const*, cudaDataType_t, int, long, void const*, cudaDataType_t, int, long, float, void*, cudaDataType_t, int, long, int, cudaDataType_t) + 367
3       0x7f4fc8d4456b /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0xf056b) [0x7f4fc8d4456b]
4       0x7f4fc8d422ad tensorrt_llm::plugins::BertAttentionPlugin::enqueue(nvinfer1::PluginTensorDesc const*, nvinfer1::PluginTensorDesc const*, void const* const*, void* const*, void*, CUstream_st*) + 77
5       0x7f5106728bec /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0x10d5bec) [0x7f5106728bec]
6       0x7f51066cfd77 /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0x107cd77) [0x7f51066cfd77]
7       0x7f51066d1851 /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0x107e851) [0x7f51066d1851]
8       0x7f50023f0434 tensorrt_llm::batch_manager::TrtEncoderModel::executeContext(int) + 52
9       0x7f50023f04f3 tensorrt_llm::batch_manager::TrtEncoderModel::executeBatch(tensorrt_llm::batch_manager::ScheduledRequests const&) + 147
10      0x7f50023f3d72 tensorrt_llm::batch_manager::TrtEncoderModel::forwardAsync(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) + 1474
11      0x7f5002425e21 tensorrt_llm::executor::Executor::Impl::forwardAsync(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) + 113
12      0x7f500242978d tensorrt_llm::executor::Executor::Impl::executionLoop() + 301
13      0x7f51d5eb0253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f51d5eb0253]
14      0x7f52510e8ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f52510e8ac3]
15      0x7f5251179a04 clone + 68
[ef4fd6a5f5e7:83566] *** Process received signal ***
[ef4fd6a5f5e7:83566] Signal: Aborted (6)
[ef4fd6a5f5e7:83566] Signal code:  (-6)

additional notes

N/A

symphonylyh commented 1 month ago

the rel_attn_table warning can be ignored. We're trying to reproduce the error. In theory this shouldn't happen because we run flan-t5 test everyday and they're passing. And your commands look good to me

lanking520 commented 1 month ago

So the error reporting is because the following value in the tritonserver is not set

parameters: {
  key: "max_tokens_in_paged_kv_cache"
  value: {
    string_value: "${max_tokens_in_paged_kv_cache}"
  }
}
parameters: {
  key: "max_attention_window_size"
  value: {
    string_value: "${max_attention_window_size}"
  }
}

After setting both of them to 4096, the model start to work. But wonder why not setting them doesn't work

jtchen0528 commented 1 month ago

Hi @lanking520,

I'm not able to reproduce your error with the latest main (2fa86dc). I build the model with your commands (without --use_custom_all_reduce disable, --tp_size, --pp_size due to deprecation of some features), and run the tritonserver following the encoder-decoder tutorial. There is no problem with or without max_tokens_in_paged_kv_cache and max_attention_window_size.

By default, max_attention_window_size would be set to min(max_attention_window_size, max_seq_len) and max_tokens_in_paged_kv_cache would be set according to current free memory on the device if kv_cache_free_gpu_mem_fraction has value. There should not be a problem if those values are not set.

Please share your tritonserver config so that we can further investigate this problem, thanks!!

_note: one parameter needed to be added to tensorrt_llm/config.pbtxt if using the latest main: max_queuesize:0