support remove_input_padding for BertForSequenceClassification models

Altair-Alpha commented 3 days ago

Related issue: https://github.com/NVIDIA/TensorRT-LLM/issues/1755 Content:

Support remove_input_padding for BertForSequenceClassification models (implementation details given in code comment)
Refinement of the build.py script, e.g., the original script doen't have a input model dir parameter, and will init the model with random weights, which is not intuitive.
Since the model input is changed from 3 to 5, i.e. input_ids, input_lengths, token_type_ids, position_ids, max_input_length, I add a standalone run_remove_input_padding.py demo script, and show how to build them with only input_ids and token_type_ids.

I only implemented and tested this for BertForSequenceClassification but not other BERT models yet, please feel free to do further work on this :)

Altair-Alpha commented 3 days ago

For our data, there's about 35% latency and 28% throughput enhancement when batch_size is 16, and the diff of output logits between the pytorch model and trt with remove_input_padding is below 0.01

nv-guomingz commented 3 days ago

Thanks @Altair-Alpha, we'll merge your changes and upstream to github next week.

NVIDIA / TensorRT-LLM

support remove_input_padding for BertForSequenceClassification models #1834