The PR implements a smart padding strategy for the BERT model.
When serving a batch of requests with different seq lens, padding is required to make all requests as the same length.
In fact, we can only padded a part of the BERT model, and left the rest part without pad.
About Smart Padding, please refer to https://github.com/bytedance/effective_transformer.
The PR implements a smart padding strategy for the BERT model. When serving a batch of requests with different seq lens, padding is required to make all requests as the same length. In fact, we can only padded a part of the BERT model, and left the rest part without pad. About Smart Padding, please refer to https://github.com/bytedance/effective_transformer.