NVIDIA / TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
https://developer.nvidia.com/tensorrt
Apache License 2.0
10.61k stars 2.11k forks source link

TensorRT 9.3 Custom plugins appear to be strangely time-consuming #4018

Closed demuxin closed 2 months ago

demuxin commented 2 months ago

Description

I implemented a TensorRT plugin and found the plugin to be particularly time-consuming.

I am compiling the plugin as a separate library and then calling it using the C++ api.

void* plugin_handle{ builder->getPluginRegistry().loadLibrary(pluginlib_path_.c_str()) };
// or
void* plugin_handle{ runtime->getPluginRegistry().loadLibrary(pluginlib_path_.c_str()) };

I used cudaStreamSynchronize for synchronization in the begin of enqueue function, and measured it to take about 165ms.

int32_t NmsdetaIPluginV2DynamicExt::enqueue(PluginTensorDesc const* inputDesc, PluginTensorDesc const* outputDesc, void const* const* inputs, void* const* outputs, void* workspace, cudaStream_t stream) IS_NOEXCEPT
{
    time_point<system_clock> m_begin = system_clock::now();

    cudaStreamSynchronize(stream);

    printf("--->> plugin: %ld, %d\n", duration_cast<microseconds>(system_clock::now() - m_begin).count(), __LINE__);
    m_begin = system_clock::now();

   ...
}

How can I solve this issue? please offer me some advice.

Environment

TensorRT Version: 9.3

NVIDIA GPU: GeForce RTX 3090

NVIDIA Driver Version: 535.183.01

CUDA Version: 12.2

CUDNN Version: 8.9.6

Operating System: ubuntu 22.04

lix19937 commented 2 months ago

How about use follow function to timing ?

std::chrono::high_resolution_clock::now();

demuxin commented 2 months ago

Hi @lix19937 , The results are the same, it shouldn't be a problem with the timing.

Do you have any other suggestions?

demuxin commented 2 months ago

And I can offer plugin code.

NmsdetaIPluginV2DynamicExt.h.txt NmsdetaIPluginV2DynamicExt.cpp.txt

image

lix19937 commented 2 months ago

From your plugin.cpp,

demuxin commented 2 months ago

Thanks @lix19937 , I know CUDAStreamSynchronize is not necessary, I just want to metric time-consuming of this plugin.

According to your statement, the 165ms is actually the elapsed time of the node before NmsDeta, right?

And how to measure the time from net-in node to NmsDeta node, or how to measure the elapsed time of every node of model?

Thank you again for your prompt reply.

lix19937 commented 2 months ago

Use follow code replace your enqueue impl.


int32_t NmsdetaIPluginV2DynamicExt::enqueue(PluginTensorDesc const* inputDesc, PluginTensorDesc const* outputDesc, void const* const* inputs, void* const* outputs, void* workspace, cudaStream_t stream) IS_NOEXCEPT{

return 0;
}

then use trt cmd , and upload the build.log.

    ./trtexec --onnx=$ONNX_filename \
    --saveEngine=$ONNX_filename.plan \
    --verbose \
    --dumpProfile \
    --noDataTransfers \
    --useCudaGraph \
    --useSpinWait  \
    --separateProfileRun \
    2>&1 | tee -a build.log
demuxin commented 2 months ago

Thanks.

demuxin commented 2 months ago

Hi @lix19937 , can polygraphy run specify custom plugin?

lix19937 commented 2 months ago

It support, by --plugins