Support user-defined batch inference logic

Describe the feature you'd like

Currently, TorchServe's batch inference is handled by looping through the requests and feeding them individually to the user-defined transform function (https://github.com/aws/sagemaker-inference-toolkit/pull/108). However, this doesn't take full advantage of GPU's parallelism and compute power, thus yielding slower endpoints with low resource usage.

On the other hand, TorchServe's documentation on batch inference, shows an example where the developer handles this logic and feeds the entire input batch to the model.

For my use case, this is highly desirable to increase the throughput of the model.

How would this feature be used? Please describe.

Provide batch_transform_fn functions. If a user wants to customize the default batch logic, they can provide functions batch_input_fn, batch_predict_fn and batch_output_fn where they are given the entire batch of requests as input.

Describe alternatives you've considered

I haven't found an alternative to achieve this functionality using the sagemaker-pytorch-inference-toolkit, so I'm writing a custom Dockerfile that just uses torchserve.

aws / sagemaker-inference-toolkit

Support user-defined batch inference logic #123