varunsh-xilinx commented 2 years ago

Is your feature request related to a problem? Please describe. Some models require multiple input tensors and/or produce multiple output tensors and the server should handle that

Describe the solution you'd like Such models can be used for inference

Describe alternatives you've considered N/A

Additional context This is theoretically supported but needs testing and verification. @bpickrel is currently trying it

bpickrel commented 2 years ago

Here's my proposal for implementation:

The server must be able to accept input requests with multiple tensors and emit outputs with multiple tensors.
Also, individual workers must be able to parse multi-input requests and submit properly to the worker's API.
Also, individual workers must be able to parse multi-output requests and assemble responses, including returning different responses within the same batch to correct requesters.

Worker behavior should be abstracted at the level of input/output tensors; that is, the worker should be general enough to process any legal model without making any model-specific assumptions about content or meaning of tensors. (For example, the Yolov4 model crams both bounding-box edges and probability scores into the same vector.)

Although I think I can make the worker completely model-agnostic, for each supported model there needs to be a client using the Inference Server API which will serve as an interface layer between user and engine. All model-specific behavior, as well as data checking, is offloaded to the client. Clients will be responsible for preprocessing and composing input requests; guaranteeing that requests are well-formed and contain safe data; and parsing and post-processing responses. This should aid optimization by allowing the worker to skip most safety and formatting checks, as long as input tensors are the right shapes.

The above is just a restatement of Inference Server's design assumptions all along, but it implies that we need a set of delivered model clients for each supported model, that go through a complete QA process.

Brian is implementing this for the Migraphx (GPU) worker, including creating example clients for a multi-input (Bert) and multi-output (Yolo) models. The clients will be minor rewrites of examples already in the Migraphx repo, modified slightly to use the Inference Server API.

varunsh-xilinx commented 2 years ago

This was addressed from #74 but awaiting confirmation from @bpickrel that it works from his end too

varunsh-xilinx commented 2 years ago

83 shows multiple inputs/outputs working with MIGraphX using Yolo and Bert. Based on that, the server-side support for this feature should be done

Xilinx / inference-server

Support multiple input and output tensors #68

83 shows multiple inputs/outputs working with MIGraphX using Yolo and Bert. Based on that, the server-side support for this feature should be done