Chains are simpler versions of model ensembles: a linear graph of workers with one input and one output. Enabling chaining allows users to define pipelines in the server. These may be used for server-side pre- and post-processing steps, for example.
Implementation
This PR implements static chaining: where the chain is defined at load-time. The chain is defined through special load-time parameter: next that has the value of the next worker in the chain. If this parameter is not present at load-time, then the server automatically inserts the Responder as the next node. This information, along with the next worker's allocators, are set on construction. Because the names of the next worker need to be set at load-time, workers in a chain must be loaded in reverse order so the endpoints of the next one are known. Calling the new loadEnsemble method with a single worker is equivalent to calling workerLoad.
Each worker (except for streaming) has been updated to work with chaining. The primary change is the change that each worker now receives one batch and must produce a new batch rather than receiving the batch queue pointer directly. Responding back to the client is all moved to the Responder that functions as the end node for all workers. Common logic (propagating batch and request metadata, extracting batches from the queue etc.) is moved into functions or the base Worker class.
Notes
"True ensembles" where the output of a common pre-processing stage is fed to multiple models in parallel is not supported yet
Loading chains from a model repository is also not currently supported
Each worker in a chain loads its own batcher still. This does enable each worker in the chain to function as an independent input into the chain but this behavior may change in the future.
The response expects a model but it can be unclear what the "model" is in a chain. For now, the workaround is to add it to the batch metadata and have a worker set it if it's empty. This means that the first worker in the chain that sets this value sets it for the overall request.
Summary of Changes
Closes #173
Motivation
Chains are simpler versions of model ensembles: a linear graph of workers with one input and one output. Enabling chaining allows users to define pipelines in the server. These may be used for server-side pre- and post-processing steps, for example.
Implementation
This PR implements static chaining: where the chain is defined at load-time. The chain is defined through special load-time parameter:
next
that has the value of the next worker in the chain. If this parameter is not present at load-time, then the server automatically inserts the Responder as the next node. This information, along with the next worker's allocators, are set on construction. Because the names of the next worker need to be set at load-time, workers in a chain must be loaded in reverse order so the endpoints of the next one are known. Calling the newloadEnsemble
method with a single worker is equivalent to callingworkerLoad
.Each worker (except for streaming) has been updated to work with chaining. The primary change is the change that each worker now receives one batch and must produce a new batch rather than receiving the batch queue pointer directly. Responding back to the client is all moved to the Responder that functions as the end node for all workers. Common logic (propagating batch and request metadata, extracting batches from the queue etc.) is moved into functions or the base Worker class.
Notes