Closed mrjackbo closed 5 years ago
Great question.
The nvRPC examples, similar to TRTIS, have several limits on the depth of the work queues.
executor->RegisterContexts(rpcCompute, rpcResources, XXX);
InferRunner
, has limits of the queue depth enforced by the InferenceManager
. We provide a canonical example of what's happening in the InferRunner
which demonstrates where the a call to the runtime might block.
auto buffers = GetResources()->GetBuffers();
Infer
method, then yes, you could get into a situation where your queue depth becomes unbounded and you run out of resources.Bindings
object, then you've had to acquire a Bindings
object, which similar to the canonical example means you've been limited by the InferenceManager
and the call may have blockedDoes this help?
The second part of your question on recognizing and reacting to queue depth.
Both TRTIS and the some of the TRTLAB examples expose Prometheus metrics. The load ratio = request_time / compute_time
is a nice way to gauge queue depth. If you just measure queue depth, that doesn't help you distinguish the type of model in the queue. A large queue depth of models that don't take long to compute means something different than a queue depth of a model that takes a long time to compute. Load Ratio help normalize that with respect to the time the model takes to compute. You can use the load ratio and gpu energy consumption metrics to trigger horizontal auto scaling. Follow Issue #20 to track progress on this.
One thing we don't do in either the LAB or TRTIS is check the gRPC deadline. We should do this. This is one reactive way we can deal with queue depth is to simply start canceling requests on the server side. If you control the client, then this is a good feedback mechanism that you need to grow the number of workers. Follow Issue #21 to track progress.
Thanks! Yes, that helps a lot. I will use a fixed size ressource pool of buffers, into which my IO threads write the input blobs.
Let me know if you run into any problems or if we can improve the experience in any way.
Hi,
starting from the basic inference example, how would you advise to implement a simple back pressure mechanism? If I understand correctly, the task queue of the thread pool implementation that Tensorrt lab uses, does not have an upper size limit. Thus, if for example, my data ingest is much faster than inference, the program will eventually run out of memory, as the input tensors have to be captured by the lambdas in the task queue. It would be nice to have an optional behavior, where enqueue becomes blocking once the task queue reaches a certain size.