Some models may require multiple input tensors and so we can support those use cases. This change also brings the server in alignment with how the inference API is used in KServe.
Implementation
Originally, multiple input tensors in a request were used as pseudo batching from the user: each tensor corresponded to a separate inference request input to a model assuming that it only had one input. This approach is conflicting with the true purpose of multiple input tensors which is to support models that actually require multiple input tensors. Since all the current tests use models with one input tensor, they should have one input tensor.
The modelInfer API is blocking so to allow for batching in the new regime, I've added an asynchronous counterpart to KServe's modelInfer API that returns a future in C++. This may also work in Python with Pybind11 but the approach taken for Python examples for now is to use the multiprocessing library to make inferences in parallel in the examples.
Summary of Changes
Closes #55
Motivation
Some models may require multiple input tensors and so we can support those use cases. This change also brings the server in alignment with how the inference API is used in KServe.
Implementation
Originally, multiple input tensors in a request were used as pseudo batching from the user: each tensor corresponded to a separate inference request input to a model assuming that it only had one input. This approach is conflicting with the true purpose of multiple input tensors which is to support models that actually require multiple input tensors. Since all the current tests use models with one input tensor, they should have one input tensor.
The
modelInfer
API is blocking so to allow for batching in the new regime, I've added an asynchronous counterpart to KServe'smodelInfer
API that returns a future in C++. This may also work in Python with Pybind11 but the approach taken for Python examples for now is to use the multiprocessing library to make inferences in parallel in the examples.