apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.78k stars 6.79k forks source link

Dynamic Batching during Inference / Runtime #20220

Open andreas-solti opened 3 years ago

andreas-solti commented 3 years ago

First, thanks for creating this great and high performant framework! I've looked in the open and closed issues and couldn't find this one.

Description

It would be really cool to be able to enable automatic batching of inference requests in the engine. The feature would dynamically wrap and unwrap similar-sized inputs in the engine based on configured max wait time and preferred batch size.

References

Expected Value

A large speedup is expected for practical use in high-load inference settings, where many users need to be served. When batching is implemented in the engine directly, it would be much faster than the currently available (best?) solution with the multi-model-server. Latter includes the overhead of a Java server + HTTP calls + Python-based batching.

github-actions[bot] commented 3 years ago

Welcome to Apache MXNet (incubating)! We are on a mission to democratize AI, and we are glad that you are contributing to it by opening this issue. Please make sure to include all the relevant context, and one of the @apache/mxnet-committers will be here shortly. If you are interested in contributing to our project, let us know! Also, be sure to check out our guide on contributing to MXNet and our development guides wiki.