[RMP] Dynamic Batching support at serving time - Githubissues

NVIDIA-Merlin / Merlin

NVIDIA Merlin is an open source library providing end-to-end GPU-accelerated recommender systems, from feature engineering and preprocessing to training deep learning models and running inference in production.

Apache License 2.0

758 stars 113 forks source link

[RMP] Dynamic Batching support at serving time #906

Open EvenOldridge opened 1 year ago

EvenOldridge commented 1 year ago

Problem:

Customers with high volumes of traffic want to trade off latency for throughput by grouping requests as dynamic batches.

Goal:

Leverage Triton's dynamic batching capabilities to enable support for dynamic batches in Merlin.

New Functionality

Models
- ...
Transformers4Rec
- ...
NVTabular
Dataloader

Systems

[ ] Dynamic batching with Triton
[ ] Serving-time padding operator (to use with dynamic batching)

Examples

[ ] Example of dynamic batching
[ ] Blog post on dynamic batching and tradeoff between latency and throughput.

Constraints:

Within Triton

Starting Point: