aws / sagemaker-inference-toolkit

Serve machine learning models within a 🐳 Docker container using 🧠 Amazon SageMaker.
Apache License 2.0
370 stars 82 forks source link

Support user-defined batch inference logic #123

Open dgcnz opened 1 year ago

dgcnz commented 1 year ago

Describe the feature you'd like

Currently, TorchServe's batch inference is handled by looping through the requests and feeding them individually to the user-defined transform function (https://github.com/aws/sagemaker-inference-toolkit/pull/108). However, this doesn't take full advantage of GPU's parallelism and compute power, thus yielding slower endpoints with low resource usage.

On the other hand, TorchServe's documentation on batch inference, shows an example where the developer handles this logic and feeds the entire input batch to the model.

For my use case, this is highly desirable to increase the throughput of the model.

How would this feature be used? Please describe.

Provide batch_transform_fn functions. If a user wants to customize the default batch logic, they can provide functions batch_input_fn, batch_predict_fn and batch_output_fn where they are given the entire batch of requests as input.

Describe alternatives you've considered

I haven't found an alternative to achieve this functionality using the sagemaker-pytorch-inference-toolkit, so I'm writing a custom Dockerfile that just uses torchserve.