[Question]Why does TensorRT enqueueV2 take longer time when using more isolated threads in C++?

John-ReleaseVersion commented 5 days ago

Description

I used C++tensorrt and found that the inference performance actually decreases in multi-threaded situations.

For example, for a single inference of one image, the execution time of enqueue is 1ms, and the total time for 20 inferences is 20ms.

However, if 20 sub threads perform inference, the execution time of a single enqueue will actually become 10ms.

The problem is the same as on Stackerflow, but has not been answered。

https://stackoverflow.com/questions/77429593/why-does-tensorrt-enqueuev2-take-longer-time-when-using-more-isolated-threads-in

Environment

OS : ubuntu 2204 CUDA : version 12.2 TensorRT : 8.6.1.6 OpenCV : 4.8.0

Code


#include <iostream>

#include "EngineInfer.h"
#include <chrono>
#include <thread>
#include <mutex>

std::string engine_path = "../models/yolov5s.engine";
std::string image_path = "../images/src.jpg";

int main(int argc, char **argv)
{

    auto start_p = std::chrono::system_clock::now();
    auto end_p = std::chrono::system_clock::now();
    using namespace std;
    DEBUG_LOG("Hello World!");

    int threadNum = 20;
    bool is_async[threadNum] = {false};
    EngineInfer infers[threadNum];
    for (int i = 0; i < threadNum; i++)
    {
        infers[i].init(engine_path.c_str());
        infers[i].setImage(image_path.c_str());
    }
    auto task = [&](int idx, EngineInfer infer)
    {
        infer.infer();
        infer.getResult();
        infer.saveImage(string("res" + std::to_string(idx) + ".jpg").c_str());
        is_async[idx] = true;
    };

    start_p = std::chrono::system_clock::now();

    for (int i = 0; i < threadNum; i++)
    {
        auto bound_task = std::bind(task, i, infers[i]);
        thread th(bound_task);
        th.detach();
    }
    for (int i = 0; i < threadNum; i++)
    {
        std::mutex mtx;
        std::lock_guard<std::mutex> lock(mtx); 
        cout << "waiting " << i;
        while (!is_async[i])
        {
            cout << "*";
            std::this_thread::sleep_for(std::chrono::milliseconds(1));
        }
        cout << endl;
    }
    for (int i = 0; i < threadNum; i++)
    {
        infers[i].release();
    }

    INFO_LOG("sum time = %d ms", std::chrono::duration_cast<std::chrono::milliseconds>(end_p - start_p).count());

    INFO_LOG("Finished!");
    return 0;
}


int EngineInfer::infer()
{
    using namespace nvinfer1;

    cudaError_t cudaErrorCode;

    cudaStreamSynchronize(stream);
    cudaErrorCode = cudaMemcpyAsync(gpu_buffers[0], img_buffer_device, imageSize, cudaMemcpyDeviceToDevice, stream);
    if (cudaErrorCode != cudaSuccess)
    {
        std::cerr << "CUDA error " << cudaErrorCode << " at " << __FILE__ << ":" << __LINE__;
        return -1;
    }
    cudaStreamSynchronize(stream);

    // bool isSuccess = context->enqueue(1, (void *const *)gpu_buffers, stream, nullptr);
    auto start = std::chrono::system_clock::now();
    bool isSuccess = context->enqueueV2((void *const *)gpu_buffers, stream, nullptr);
    auto end = std::chrono::system_clock::now();
    INFO_LOG("one infer spend =%d ms", std::chrono::duration_cast<std::chrono::milliseconds>(start - end).count());

    // bool isSuccess = context->enqueueV3(gpu_buffers);
    if (!isSuccess)
    {
        ERROR_LOG("Infer error ");
        return -1;
    }
    cudaStreamSynchronize(stream);
    cudaErrorCode = cudaMemcpyAsync(cpu_output_buffer, gpu_buffers[1], 1 * kOutputSize * sizeof(float),
                                    cudaMemcpyDeviceToHost, stream);
    if (cudaErrorCode != cudaSuccess)
    {
        ERROR_LOG("CUDA error");
        return -1;
    }
    cudaStreamSynchronize(stream);

    return 0;
}

Daily summary

Single Run

one infer spend =11 ms

Multi Run

[INFO ] one infer spend =88 ms

What I have try

At first, I suspected it was an asynchronous flow issue, but after switching to synchronous operation, I found that it was not an asynchronous flow issue. Then I suspect it's a situation of competition for critical resources, not really. Guess it might be a problem with CUDA switching frequently?

What is my expecting

I hope to improve the efficiency of executing enqueue with multiple threads.

lix19937 commented 5 days ago

Maybe you need pay attention to the cpu pthread state in nsys profile ui.

John-ReleaseVersion commented 5 days ago

Maybe you need pay attention to the cpu pthread state in nsys profile ui.

It is not an additional time caused by multi-threaded switching.

I just discovered that there is also an increase in inference execution time in multiple processes.

lix19937 commented 5 days ago

If you want to use multiple processes to infer, you need to use MPS, to advoid the time-slice of cuda ctx.

John-ReleaseVersion commented 5 days ago

If you want to use multiple processes to infer, you need to use MPS, to advoid the time-slice of cuda ctx.

First of all, thank you for your help. I have also tried the issue you mentioned with MPS, and the results have not made any difference. Recent attempts have found that when inferring a single model, the first image enqueueV2 takes the longest time, and the subsequent time consumption will decrease. Suspect that it may be a problem of frequent switching between multiple model inferences, and prepare to verify it later.

lix19937 commented 5 days ago

the first image enqueueV2 takes the longest time

Need init cuda resource, that is warmup.

NVIDIA / TensorRT