NVIDIA / tensorrt-laboratory

Explore the Capabilities of the TensorRT Platform
https://developer.nvidia.com/tensorrt
BSD 3-Clause "New" or "Revised" License
261 stars 50 forks source link

Back pressure for thread pool task queue #19

Closed mrjackbo closed 5 years ago

mrjackbo commented 5 years ago

Hi,

starting from the basic inference example, how would you advise to implement a simple back pressure mechanism? If I understand correctly, the task queue of the thread pool implementation that Tensorrt lab uses, does not have an upper size limit. Thus, if for example, my data ingest is much faster than inference, the program will eventually run out of memory, as the input tensors have to be captured by the lambdas in the task queue. It would be nice to have an optional behavior, where enqueue becomes blocking once the task queue reaches a certain size.

ryanolson commented 5 years ago

Great question.

The nvRPC examples, similar to TRTIS, have several limits on the depth of the work queues.

Does this help?

The second part of your question on recognizing and reacting to queue depth.

Both TRTIS and the some of the TRTLAB examples expose Prometheus metrics. The load ratio = request_time / compute_time is a nice way to gauge queue depth. If you just measure queue depth, that doesn't help you distinguish the type of model in the queue. A large queue depth of models that don't take long to compute means something different than a queue depth of a model that takes a long time to compute. Load Ratio help normalize that with respect to the time the model takes to compute. You can use the load ratio and gpu energy consumption metrics to trigger horizontal auto scaling. Follow Issue #20 to track progress on this.

One thing we don't do in either the LAB or TRTIS is check the gRPC deadline. We should do this. This is one reactive way we can deal with queue depth is to simply start canceling requests on the server side. If you control the client, then this is a good feedback mechanism that you need to grow the number of workers. Follow Issue #21 to track progress.

mrjackbo commented 5 years ago

Thanks! Yes, that helps a lot. I will use a fixed size ressource pool of buffers, into which my IO threads write the input blobs.

ryanolson commented 5 years ago

Let me know if you run into any problems or if we can improve the experience in any way.