[FEA] Add remote inference support with the use of Triton Inference Server

MMelQin commented 2 years ago

Is your feature request related to a problem? Please describe. App SDK currently supports inference within the application process itself. This is simple and efficient for some use cases, though when multiple applications/models are hosted in the "production" environment, remote inference service, e.g. Triton, may be needed, so that the heavy use of resource of inference itself can be centrally managed.
Describe the solution you'd like Add remote inference support to the built-in inference operators in the App SDK, with runtime options, e.g. using strategy pattern, to support the choice of in proc, or remote inference. Describe alternatives you've considered One of the main reasons to use remote inference server, e.g. Triton, is to have dedicated model and required runtime resource management (scheduling and queuing inference requests), so that the application need to directly request local GPU and/or system memory. With the remote service, the whole application with in-proc inference, then need to be scheduled on servers running multiple applications or instances thereof, to ensure resources are available when the app requests them. Simpler with just system mem request, but GPU mem requests has to be properly managed (e.g. K8s fractional request on a visible GPU). Additional context

vikashg commented 2 years ago

If we add the Triton Support, we should also add a method to extract the names of the input and output nodes for creating the config.pbtxt file needed for Triton. @slbryson worked on it last month and he should have more notes on this.

Can we also create this config.pbtxt file automatically given a pytorch model ?

MMelQin commented 2 years ago

@vikashg Just to add a little more information after the monthly sync-up meeting with the Triton team. The Triton Inference Server can parse the metadata (tensor dims, data types etc) of all supported model types, except PyTorch for its inherent lack of such support, because of which, the Triton team actually filed an issue for PyTorch,, over a year ago, to embedded metadata in PyTorch.

Within MONAI, similar issue had also being discussed and an issue/PR was created to add model metadata in the model zip, as a non-standard way to convey the information, and it is really up to the model exporter to decide setting the metadata or not. Of course, model of unknown provenance will not adhere to this anyway.

I will file a separate ticket for Triton, specifying the need for it to load the TorchScript model, and parse out the tensor dims, and types (the tensor names are really Triton specific and can be chosen by the app dev); this will piggy back on Triton's request on PyTorch.

ericspod commented 2 years ago

Hi @MMelQin and @vikashg, the PR I opened on MONAI would be a good fix for the lack of metadata, I can see this mechanism used to store information that Triton would use as well as huge variety and volume of other things relating to the model and its use context. On top of a metadata JSON file we could also include example notebooks or scripts in the Torchscript zip file. If you have any comments to add to the PR please do and I can revisit it to get it integrated into core if you think it's a good mechanism.

ericspod commented 2 years ago

I'll mention here that the core team is hashing out a format for stored models that would include more information than just metadata. We've started looking at MMAR and the experience with that, and comparing with how MLFlow, Huggingface, and others have tackled similar problems.

ristoh commented 2 years ago

@ericspod can you add a link to the PR or conversation from the core wg work you're referring to?

dbericat commented 2 years ago

@CPBridge have a look at this.

ericspod commented 2 years ago

We have an issue open for discussion.

Project-MONAI / monai-deploy-app-sdk

[FEA] Add remote inference support with the use of Triton Inference Server #212