NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.83k stars 1.01k forks source link

Flan-T5 models with Tensor Parallelism #1286

Open hademircii opened 8 months ago

hademircii commented 8 months ago

System Info

I am experimenting with TRT LLM and flan-t5 models. My simple goal is to build engines with different configurations and tensor parallelism, then review performance. Have a DGX system and an AWS P4de that I can work on (a100s). Did a full stack upgrade to each to see if it fixes the problem with no luck.

Who can help?

@byshiue @ncom

Information

Tasks

Reproduction

follow the README for encoder-decoder models here (https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/enc_dec#download-weights-from-huggingface-transformers) focusing on flan-t5-small (or use large). go for example #3 (https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/enc_dec#build-tensorrt-engines)

Expected behavior

build command exits successfully with engine artifacts exported in target directory.

actual behavior

I have tried on a DGX system, an AWS P4de instance, with different TP arrangements, small/large flan-t5 models, adding/removing flags for plug-ins; regardless of the configuration the engine build process errors out when building the decoder layer (can see the encoder under trt_engine directory. one way or another, all failure modes appear to be at layer: DecoderModel/decoder_layers/0/cross_attention with error log:

[03/12/2024-16:38:05] [TRT] [E] 4: (Unnamed Layer* 95) [Output]: IIfConditionalOutputLayer inputs must have the same shape. Shapes are [-1,576] and [-1,1152].
[03/12/2024-16:38:05] [TRT] [E] 4: [graphShapeAnalyzer.cpp::needTypeAndDimensions::2221] Error Code 4: Internal Error (DecoderModel/decoder_layers/0/cross_attention/PLUGIN_V2_GPTAttention_0: output shape can not be computed)
[03/12/2024-16:38:05] [TRT] [E] 4: [graphShapeAnalyzer.cpp::needTypeAndDimensions::2221] Error Code 4: Internal Error (DecoderModel/decoder_layers/0/cross_attention/dense/PLUGIN_V2_AllReduce_0: output shape can not be computed)
Traceback (most recent call last):
  File "/TensorRT-LLM/examples/enc_dec/build.py", line 605, in <module>
    run_build(component='decoder')
  File "/TensorRT-LLM/examples/enc_dec/build.py", line 596, in run_build
    build(0, args)
  File "/TensorRT-LLM/examples/enc_dec/build.py", line 540, in build
    engine = build_rank_engine(builder, builder_config, engine_name,
  File "/TensorRT-LLM/examples/enc_dec/build.py", line 469, in build_rank_engine
    tllm_model(*inputs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in __call__
    output = self.forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/enc_dec/model.py", line 1097, in forward
    hidden_states = decoder_layer(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in __call__
    output = self.forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/enc_dec/model.py", line 477, in forward
    hidden_states = residual + attention_output
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/functional.py", line 322, in __add__
    return add(self, b)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/functional.py", line 2362, in elementwise_binary
    left, right = broadcast_helper(left, right)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/functional.py", line 2307, in broadcast_helper
    if left.rank() == right.rank():
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/functional.py", line 465, in rank
    return len(self.trt_tensor.shape)
ValueError: __len__() should return >= 0

additional notes

without tensor parallelism (tp=1), following the readme work outs fine for small/large t5's. I wonder if anyone had success with flan-t5 models with tensor parallelism ?

github-actions[bot] commented 5 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."

symphonylyh commented 5 months ago

Hi @hademircii , were you able to retest on latest main or recent 0.10 release branches? There are many changes since March, and we believe such issue has been fixed for a while. With your confirmation, I will close the issue. Thanks!