Why tensorrt’s performance is poor after adding custom op

Shaquille-Wu commented 3 months ago

Hi, TRT experts:

I have a custom op which is not supported by tensorrt so, I add it as a plugin into tensorrt I found the whole cost time is improve about 10ms my test as following:

I remove this custom op from my onnx file, and export it as .plan file through trtexec, and the cost of whole network is about 50ms;
I add this custom op(just cudaMemcpy a little data) into my onnx file, and export it as .plan file through trtexec, and the cost of whole network is about 60ms; I let my code return directly in the enque function, I found the cost of whole network is still about 60ms, the code like this:

int MyPluginDynamic::enqueue(const nvinfer1::PluginTensorDesc* inputDesc, const nvinfer1::PluginTensorDesc* outputDesc, const void* const* inputs, void* const* outputs, void* workspace, cudaStream_t stream) TRT_NOEXCEPT { return 0; //return directly }

I don’t know why trt’s performance is poor after I adding a little custom op, I guess:

there are some secrect about trt which I don’t know.
my op import extra overhead which I don’t know; So, Is there anyone would like to teach me this secret?

lix19937 commented 3 months ago

Can you upload build.log by trtexec --onnx=spec --verbose --plugins=spec 2>&1 |tee build.log ?

Shaquille-Wu commented 3 months ago

Can you upload build.log by trtexec --onnx=spec --verbose --plugins=spec 2>&1 |tee build.log ?

thanks for you help, I've upload the log:

without custom_op: build.log
add custom_op: custom_op_build.log

check them please

build.log custom_op_build.log

lix19937 commented 3 months ago

Checking.

please use follow cmd to add other info

trtexec --onnx=spec.onnx   --verbose --saveEngine=spec.plan  \
--dumpProfile --dumpLayerInfo --separateProfileRun \
--noDataTransfers --useCudaGraph --useSpinWait   | tee log

Shaquille-Wu commented 2 months ago

Checking.

please use follow cmd to add other info

trtexec --onnx=spec.onnx   --verbose --saveEngine=spec.plan  \
--dumpProfile --dumpLayerInfo --separateProfileRun \
--noDataTransfers --useCudaGraph --useSpinWait   | tee log

thanks for your checking I re-genrate the log, check it please:

build_profile.log build_profile_custom_op.log

lix19937 commented 2 months ago

Because you only impl part ops by plugin, it break the fusion, but trt nativate build those ops in some foreign nodes, which better than your plugin + native ops.

lix19937 commented 2 months ago

Obviously, there are a large number of unfused layers on the left side.

Shaquille-Wu commented 2 months ago

Because you only impl part ops by plugin, it break the fusion, but trt nativate build those ops in some foreign nodes, which better than your plugin + native ops.

How can I enable those "fusion", if I add a custom op? you mean, I should add my custom op into trt source code, and recompile trt ? would you like to tell me furthermore details? for example, what is foreign node? how to add custom op into foreign nodes?

lix19937 commented 2 months ago

You can first use onnx-simplifier or polygraphy tools to optimize your onnx, then try to expand the scope of custom plugins, just compile a use custom plugin so. Like follow samples,

https://github.com/NVIDIA/TensorRT/tree/release/10.2/plugin can build a lib.

Shaquille-Wu commented 2 months ago

I've executed onnx-simplifier before onnx2trt, I added my custom op into onnx after onnx-simplifier and before onnx2trt. So, I think my onnx graph is a simplified graph. I didn't find outstanding difference between my custom op plugin and the trt's official plugin I still cannot understand why the trt's official plugin can enable the "fusion", why? you mean, I must add my custom op plugin into trt's source code, an recompile it?

lix19937 commented 2 months ago

I still cannot understand why the trt's official plugin can enable the "fusion", why?

trt native build by myelin.

you mean, I must add my custom op plugin into trt's source code, an recompile it?

You can build(compile) a custom_plugin.so follow trt oss sample.

NVIDIA / TensorRT

Why tensorrt’s performance is poor after adding custom op #4029