Feature request: implement VPF <-> Triton interoperability

rarzumanyan commented 3 years ago

Description is taken from #205 by @philipp-schmidt

Triton is basically a server running your ML networks and offering http or grpc endpoints for them. Clients can send a request and invoke the networks and receive results without having to use frameworks or bother with implementation or GPU acceleration, etc. It's by far the best tool to scale ML workload imo.

To start a request to the server there are basically three options to pass the data (e.g. an image to analyze via object detection):

pass your data via http or grpc payload (only option if you are not on the same compute node, so over the network)
register and pass a pointer to shared memory with the data
register and pass a pointer to shared memory ON GPU with the data

Obviously the performance increases with the order in the list as the amount of memory copy operations goes down to basically zero with sharing GPU memory.

It would obviously be very useful if VPF could simply provide the necessary pointers to the data in the GPU memory to Triton Inference Server. This would mean that VPF needs to have a very good set of processing layers (like the already existing scaling and color convert operation), because most networks expect a very well defined input and conversion needs to happen on GPU. And there might be things like batching (compute multiple images at once) which might make things more complicated.

rarzumanyan commented 2 years ago

Collecting more user data to put requirements together.

@philipp-schmidt @clinthidinger @ruilongzhang @dbasbabasi

Please provide some information regarding the way you'd like to use VPF in conjunction with Triton. As far as I understand, there are several possible options:

Link VPF against Triton server C libraries: #1. Basically, Triton will be used as back-end and it shall orchestrate the inference on the local machine.
Implement gRPC / HTTP client which supports shared system memory: #2.
Implement gRPC / HTTP client which supports shared CUDA memory: #3.
Develop a Python sample which uses Triton client Python API in conjunction with VPF Python API to get some video frames and send them to Triton server for inference.

Please correct me if I'm missing something.

philipp-schmidt commented 2 years ago

Sorry for the late reply. The only option giving a performance benefit would be option 3. All the other options require a copy back to host memory. If the image is in host memory then passing it to Triton is really easy and literally what it was built for.

Again emphasizing that I'm currently not seeing any bottlenecks with copying the decoded frames back to host memory, even when the GPU is under heavy load. It looks like gpu memory bandwidth is plenty enough for most usecases. This would be more of a "nice to have" and "wow what an efficient architecture" thing do to.

philipp-schmidt commented 2 years ago

Also I believe it might already be implemented like that in the Deepstream Triton plugin. Not sure though.

deepsworld commented 2 years ago

Collecting more user data to put requirements together.

@philipp-schmidt @clinthidinger @ruilongzhang @dbasbabasi

Please provide some information regarding the way you'd like to use VPF in conjunction with Triton. As far as I understand, there are several possible options:

Link VPF against Triton server C libraries: #1. Basically, Triton will be used as back-end and it shall orchestrate the inference on the local machine.

Implement gRPC / HTTP client which supports shared system memory: #2.

Implement gRPC / HTTP client which supports shared CUDA memory: #3.

Develop a Python sample which uses Triton client Python API in conjunction with VPF Python API to get some video frames and send them to Triton server for inference.

Please correct me if I'm missing something.

I would be interested in 3. where VPF handles decoding, scaling, etc on gpu and also handles asynchronous batch creation from multiple video streams.

SarthakGarg19 commented 1 year ago

Are there any updates on this?

rizwanishaq commented 1 year ago

As triton-inference server has python-backend, is it possible to add it to the triton-inference server just by like pip install something?

NVIDIA / VideoProcessingFramework

Feature request: implement VPF <-> Triton interoperability #207