Back pressure for thread pool task queue

mrjackbo commented 5 years ago

Hi,

starting from the basic inference example, how would you advise to implement a simple back pressure mechanism? If I understand correctly, the task queue of the thread pool implementation that Tensorrt lab uses, does not have an upper size limit. Thus, if for example, my data ingest is much faster than inference, the program will eventually run out of memory, as the input tensors have to be captured by the lambdas in the task queue. It would be nice to have an optional behavior, where enqueue becomes blocking once the task queue reaches a certain size.

ryanolson commented 5 years ago

Great question.

The nvRPC examples, similar to TRTIS, have several limits on the depth of the work queues.

the number of gRPC contexts registered with the executor only allow for XXX gRPC messages to be pullled off the wire executor->RegisterContexts(rpcCompute, rpcResources, XXX);
- if you are using the c++ or the python tensorrt runtime, InferRunner, has limits of the queue depth enforced by the InferenceManager. We provide a canonical example of what's happening in the InferRunner which demonstrates where the a call to the runtime might block. auto buffers = GetResources()->GetBuffers();
- if you use the offload variant of the Infer method, then yes, you could get into a situation where your queue depth becomes unbounded and you run out of resources.
- if you use the direct variant which operates on a Bindings object, then you've had to acquire a Bindings object, which similar to the canonical example means you've been limited by the InferenceManager and the call may have blocked

Does this help?

The second part of your question on recognizing and reacting to queue depth.

Both TRTIS and the some of the TRTLAB examples expose Prometheus metrics. The load ratio = request_time / compute_time is a nice way to gauge queue depth. If you just measure queue depth, that doesn't help you distinguish the type of model in the queue. A large queue depth of models that don't take long to compute means something different than a queue depth of a model that takes a long time to compute. Load Ratio help normalize that with respect to the time the model takes to compute. You can use the load ratio and gpu energy consumption metrics to trigger horizontal auto scaling. Follow Issue #20 to track progress on this.

One thing we don't do in either the LAB or TRTIS is check the gRPC deadline. We should do this. This is one reactive way we can deal with queue depth is to simply start canceling requests on the server side. If you control the client, then this is a good feedback mechanism that you need to grow the number of workers. Follow Issue #21 to track progress.

mrjackbo commented 5 years ago

Thanks! Yes, that helps a lot. I will use a fixed size ressource pool of buffers, into which my IO threads write the input blobs.

ryanolson commented 5 years ago

Let me know if you run into any problems or if we can improve the experience in any way.

NVIDIA / tensorrt-laboratory

Back pressure for thread pool task queue #19