Xilinx / inference-server

https://xilinx.github.io/inference-server/
Apache License 2.0
43 stars 13 forks source link

Add C++ worker #172

Closed varunsh-xilinx closed 1 year ago

varunsh-xilinx commented 1 year ago

Summary of Changes

Motivation

Similar to the other workers that accept a "model" and execute it, I've added a C++ worker that can execute a C++ shared library. Instead of needing to write custom workers for every new C++-based operation, a simpler "model" can be described that should accept a batch, perform some computations, and produce a new batch. Thus, the C++ worker is the first baby step towards model ensembles: all workers will need to do this.

Implementation

Historically, ParameterMap was passed around as a mix of shared pointers, raw pointers, references and by value at different points. Now, it's more uniformly passed around with a const reference or by value where the underlying function modifies it. All calls accept the object itself rather than an address or pointer.

The C++ worker uses dynamic loading to load a shared library that defines a C++ model. The C++ model API is defined as follows (for now):

  1. getInputs(): get a vector of Tensors describing the inputs
  2. getOutputs(): get a vector of Tensors describing the outputs
  3. run(): accepts two batch pointers (one input, one output)

The model is not given access to the memory pool and so the C++ worker is responsible for creating a new batch, allocating buffers based on the model's inputs and outputs, and passing this batch to the model. Of course, this only works if the input and output shapes are known. I've added two models so far, echo and echo_multi, based on their workers of the same name. In theory, those workers can be entirely replaced by these two models but I haven't done it yet because the echo worker is used in many of the tests. But the two models with different input/output shapes adds confidence that the worker can run arbitrary C++ models.

The other change has been to refactor the inference request objects and tensors into new files and create an inheritance hierarchy based on Tensor -> InferenceTensor -> (InferenceRequestInput, InferenceResponseOutput). This lets us use the objects in related contexts and reduces logic duplication.

Notes

Currently, "models" with non-deterministic input and output tensor shapes are not supported. One way to support variable input/output shapes would be to let the model return empty input/output shapes and use a different override of the run method that accepts a single batch pointer and returns a new unique_ptr to a batch. The model would have to allocate new memory separate from the memory pool.

gbuildx commented 1 year ago

Build successful!