TensorRT 9.3 Custom plugins appear to be strangely time-consuming

demuxin commented 2 months ago

Description

I implemented a TensorRT plugin and found the plugin to be particularly time-consuming.

I am compiling the plugin as a separate library and then calling it using the C++ api.

void* plugin_handle{ builder->getPluginRegistry().loadLibrary(pluginlib_path_.c_str()) };
// or
void* plugin_handle{ runtime->getPluginRegistry().loadLibrary(pluginlib_path_.c_str()) };

I used cudaStreamSynchronize for synchronization in the begin of enqueue function, and measured it to take about 165ms.

int32_t NmsdetaIPluginV2DynamicExt::enqueue(PluginTensorDesc const* inputDesc, PluginTensorDesc const* outputDesc, void const* const* inputs, void* const* outputs, void* workspace, cudaStream_t stream) IS_NOEXCEPT
{
    time_point<system_clock> m_begin = system_clock::now();

    cudaStreamSynchronize(stream);

    printf("--->> plugin: %ld, %d\n", duration_cast<microseconds>(system_clock::now() - m_begin).count(), __LINE__);
    m_begin = system_clock::now();

   ...
}

How can I solve this issue? please offer me some advice.

Environment

TensorRT Version: 9.3

NVIDIA GPU: GeForce RTX 3090

NVIDIA Driver Version: 535.183.01

CUDA Version: 12.2

CUDNN Version: 8.9.6

Operating System: ubuntu 22.04

lix19937 commented 2 months ago

How about use follow function to timing ?

std::chrono::high_resolution_clock::now();

demuxin commented 2 months ago

Hi @lix19937 , The results are the same, it shouldn't be a problem with the timing.

Do you have any other suggestions?

demuxin commented 2 months ago

And I can offer plugin code.

NmsdetaIPluginV2DynamicExt.h.txt NmsdetaIPluginV2DynamicExt.cpp.txt

lix19937 commented 2 months ago

From your plugin.cpp,

1, need not sync stream, just use cudaMemcpyAsync with your local pinned-host ptr(which open a disk file in Init phase)
2, your method of timing the CUDAStreamSynchronize function is wrong, the time-consuming you get is not CUDAStreamSynchronize call used.
CUDAStreamSynchronize() is a CUDA function used to synchronize device execution. This function will block the execution of host code until all previous asynchronous operations in the specified stream have been completed. This ensures that all previous device operations have been completed before continuing to execute the host code.
3, how about the time from net-in node to NmsDeta node ? you can try to metric.

demuxin commented 2 months ago

Thanks @lix19937 , I know CUDAStreamSynchronize is not necessary, I just want to metric time-consuming of this plugin.

According to your statement, the 165ms is actually the elapsed time of the node before NmsDeta, right?

And how to measure the time from net-in node to NmsDeta node, or how to measure the elapsed time of every node of model?

Thank you again for your prompt reply.

lix19937 commented 2 months ago

Use follow code replace your enqueue impl.


int32_t NmsdetaIPluginV2DynamicExt::enqueue(PluginTensorDesc const* inputDesc, PluginTensorDesc const* outputDesc, void const* const* inputs, void* const* outputs, void* workspace, cudaStream_t stream) IS_NOEXCEPT{

return 0;
}

then use trt cmd , and upload the build.log.

    ./trtexec --onnx=$ONNX_filename \
    --saveEngine=$ONNX_filename.plan \
    --verbose \
    --dumpProfile \
    --noDataTransfers \
    --useCudaGraph \
    --useSpinWait  \
    --separateProfileRun \
    2>&1 | tee -a build.log

demuxin commented 2 months ago

Thanks.

demuxin commented 2 months ago

Hi @lix19937 , can polygraphy run specify custom plugin?

lix19937 commented 2 months ago

It support, by --plugins

NVIDIA / TensorRT

TensorRT 9.3 Custom plugins appear to be strangely time-consuming #4018

Description

Environment