NVIDIA / TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
https://developer.nvidia.com/tensorrt
Apache License 2.0
10.68k stars 2.12k forks source link

There is no speed up with trt model compared with pytorch. #1925

Closed Milesld closed 2 years ago

Milesld commented 2 years ago

Hi, thanks for your great work. I'm a beginner in tensorrt and I met a problem waiting to solve.

I convert my pth model to onnx with python and then convert to trt with trtexec. The code covert to trt shows below: ./trtexec --tacticSources=-cublasLt,+cublas --verbose --onnx=./model.onnx --explicitBatch --saveEngine=./model.engine --workspace=1000 But when I test the trt model, it shows no speed up. Then I check the model with trtexec profiling, the code shows below: ./trtexec --loadEngine=./model.engine --batch=1 --dumpProfile --profilingVerbosity=detailed --dumpLayerInfo The result shows that there is one node named "{ForeignNode[(Unnamed Layer* 1000) [LoopOutput][length][Constant]...Concat_326]}" cost abuout 85% time at the end of the model. But that node should just do a 'concat' operation. Figs showed below: image

image And the netron onnx graph shows below: image

And, I have tried to remove the concat layer in the pth model, but the time consume node also exists and move to the before node.

I will be very appreciated if anybody could help...

zerollzeng commented 2 years ago

@ttyio Looks like a myelin bug?

ttyio commented 2 years ago

Yes, @Milesld could you try 8.4, we have many fix in myelin, If this still failed, could you share us the ONNX file? thanks!

Milesld commented 2 years ago

Yes, @Milesld could you try 8.4, we have many fix in myelin, If this still failed, could you share us the ONNX file? thanks!

Hello, thanks for your reply. Actually, I'm using 8.4 now... And I will upload the onnx file later today, because it's not easy for me to upload the file in corporate intranet.

Milesld commented 2 years ago

Yes, @Milesld could you try 8.4, we have many fix in myelin, If this still failed, could you share us the ONNX file? thanks!

Hello, here is the link of the onnx models:https://drive.google.com/file/d/18zcgRQyLhkRCCaYt-wMAqDkJISCuR0pp/view?usp=sharing It contains two models named temp.onnx and temp1.onnx. temp.onnx is the origin model and I add a concat layer in the last output layer in temp1.onnx. Looking forward to your reply!

Milesld commented 2 years ago

Oh, I forgot something... There's one node in the onnx model, Clip_14, need to modify. It's max value is inf and need to be set to a constant value. I set to 10000000000. The code I used shows below: ` import onnx

onnx_model = onnx.load("./temp.onnx") graph = onnx_model.graph node = graph.node

for i in range(len(node)): if node[i].op_type == "Clip": node_rise = node[i] if node_rise.output[0] == "331": node[i].attribute[0].f = 10000000000.

onnx.checker.check_model(onnx_model) onnx.save(onnx_model, './temp_modify.onnx') `

ttyio commented 2 years ago

@Milesld , checked the ForeignNode contains not only the concat node, but also the GRU nodes with many nodes surround it. And the verbose log did not print all of it because there is too many nodes. So I think this is false alarm, could you also take a check? thanks!

Milesld commented 2 years ago

@Milesld , checked the ForeignNode contains not only the concat node, but also the GRU nodes with many nodes surround it. And the verbose log did not print all of it because there is too many nodes. So I think this is false alarm, could you also take a check? thanks!

Oh, I have check that and you are right. It's do seems like a false alarm. So, I wonder if this is a common situation that trt model may run as fast as pth model or even slower? Or what can I do to make trt model run faster?

ttyio commented 2 years ago

@Milesld just a reminder, did you use torch.cuda.event to measure the torch perf, see https://pytorch.org/docs/stable/notes/cuda.html#asynchronous-execution

To unleash the trt perf, we can also try lower precision by --fp16/--int8, we can try cudagraph using --useCudaGraph. we can get more perf gain if the GPU has tensorcores.

Milesld commented 2 years ago

@ttyio OK, I will try these and close this issue. Thank you so much!

jinec commented 1 year ago

@Milesld :hello,Have you solved this problem?I have a similar problem with you

dizhenx commented 9 months ago

hello,Have you solved this problem?I have a similar problem with you