NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
7.34k stars 794 forks source link

support remove_input_padding for BertForSequenceClassification models #1834

Open Altair-Alpha opened 3 days ago

Altair-Alpha commented 3 days ago

Related issue: https://github.com/NVIDIA/TensorRT-LLM/issues/1755 Content:

  1. Support remove_input_padding for BertForSequenceClassification models (implementation details given in code comment)
  2. Refinement of the build.py script, e.g., the original script doen't have a input model dir parameter, and will init the model with random weights, which is not intuitive.
  3. Since the model input is changed from 3 to 5, i.e. input_ids, input_lengths, token_type_ids, position_ids, max_input_length, I add a standalone run_remove_input_padding.py demo script, and show how to build them with only input_ids and token_type_ids.

I only implemented and tested this for BertForSequenceClassification but not other BERT models yet, please feel free to do further work on this :)

Altair-Alpha commented 3 days ago

For our data, there's about 35% latency and 28% throughput enhancement when batch_size is 16, and the diff of output logits between the pytorch model and trt with remove_input_padding is below 0.01

nv-guomingz commented 3 days ago

Thanks @Altair-Alpha, we'll merge your changes and upstream to github next week.