Closed varunsh-xilinx closed 2 years ago
Here's my proposal for implementation:
Worker behavior should be abstracted at the level of input/output tensors; that is, the worker should be general enough to process any legal model without making any model-specific assumptions about content or meaning of tensors. (For example, the Yolov4 model crams both bounding-box edges and probability scores into the same vector.)
Although I think I can make the worker completely model-agnostic, for each supported model there needs to be a client using the Inference Server API which will serve as an interface layer between user and engine. All model-specific behavior, as well as data checking, is offloaded to the client. Clients will be responsible for preprocessing and composing input requests; guaranteeing that requests are well-formed and contain safe data; and parsing and post-processing responses. This should aid optimization by allowing the worker to skip most safety and formatting checks, as long as input tensors are the right shapes.
The above is just a restatement of Inference Server's design assumptions all along, but it implies that we need a set of delivered model clients for each supported model, that go through a complete QA process.
Brian is implementing this for the Migraphx (GPU) worker, including creating example clients for a multi-input (Bert) and multi-output (Yolo) models. The clients will be minor rewrites of examples already in the Migraphx repo, modified slightly to use the Inference Server API.
This was addressed from #74 but awaiting confirmation from @bpickrel that it works from his end too
Is your feature request related to a problem? Please describe. Some models require multiple input tensors and/or produce multiple output tensors and the server should handle that
Describe the solution you'd like Such models can be used for inference
Describe alternatives you've considered N/A
Additional context This is theoretically supported but needs testing and verification. @bpickrel is currently trying it