Open John-ReleaseVersion opened 5 days ago
Maybe you need pay attention to the cpu pthread state in nsys profile ui.
Maybe you need pay attention to the cpu pthread state in nsys profile ui.
It is not an additional time caused by multi-threaded switching.
I just discovered that there is also an increase in inference execution time in multiple processes.
If you want to use multiple processes to infer, you need to use MPS, to advoid the time-slice of cuda ctx.
If you want to use multiple processes to infer, you need to use MPS, to advoid the time-slice of cuda ctx.
First of all, thank you for your help. I have also tried the issue you mentioned with MPS, and the results have not made any difference. Recent attempts have found that when inferring a single model, the first image enqueueV2 takes the longest time, and the subsequent time consumption will decrease. Suspect that it may be a problem of frequent switching between multiple model inferences, and prepare to verify it later.
the first image enqueueV2 takes the longest time
Need init cuda resource, that is warmup.
Description
I used C++tensorrt and found that the inference performance actually decreases in multi-threaded situations.
For example, for a single inference of one image, the execution time of enqueue is 1ms, and the total time for 20 inferences is 20ms.
However, if 20 sub threads perform inference, the execution time of a single enqueue will actually become 10ms.
The problem is the same as on Stackerflow, but has not been answered。
https://stackoverflow.com/questions/77429593/why-does-tensorrt-enqueuev2-take-longer-time-when-using-more-isolated-threads-in
Environment
OS : ubuntu 2204 CUDA : version 12.2 TensorRT : 8.6.1.6 OpenCV : 4.8.0
Code
Daily summary
Single Run
Multi Run
What I have try
At first, I suspected it was an asynchronous flow issue, but after switching to synchronous operation, I found that it was not an asynchronous flow issue. Then I suspect it's a situation of competition for critical resources, not really. Guess it might be a problem with CUDA switching frequently?
What is my expecting
I hope to improve the efficiency of executing enqueue with multiple threads.