Smart Batching for Bert.

Tencent / TurboTransformers

a fast and user-friendly runtime for transformer inference (Bert, Albert, GPT2, Decoders, etc) on CPU and GPU.

Other

1.49k stars 198 forks source link

Smart Batching for Bert. #217

Closed feifeibear closed 3 years ago

feifeibear commented 3 years ago

The PR implements a smart padding strategy for the BERT model. When serving a batch of requests with different seq lens, padding is required to make all requests as the same length. In fact, we can only padded a part of the BERT model, and left the rest part without pad. About Smart Padding, please refer to https://github.com/bytedance/effective_transformer.