Open Altair-Alpha opened 3 days ago
For our data, there's about 35% latency and 28% throughput enhancement when batch_size is 16, and the diff of output logits between the pytorch model and trt with remove_input_padding is below 0.01
Thanks @Altair-Alpha, we'll merge your changes and upstream to github next week.
Related issue: https://github.com/NVIDIA/TensorRT-LLM/issues/1755 Content:
build.py
script, e.g., the original script doen't have a input model dir parameter, and will init the model with random weights, which is not intuitive.input_ids, input_lengths, token_type_ids, position_ids, max_input_length
, I add a standalonerun_remove_input_padding.py
demo script, and show how to build them with onlyinput_ids
andtoken_type_ids
.I only implemented and tested this for
BertForSequenceClassification
but not other BERT models yet, please feel free to do further work on this :)