TensorRT 10.3 is 3+ times slower than p ytorch when running inference on Gpus A30 and 4090

CallmeZhangChenchen commented 2 weeks ago

Description

Under the same conditions, my model inference speed tensort is several times slower than pytorch

Environment

TensorRT Version: TensorRT.trtexec [TensorRT v100300]

NVIDIA GPU: A30 & 4090

NVIDIA Driver Version: 535.104.05

CUDA Version: release 12.4, V12.4.131

CUDNN Version: **

Operating System:

Python Version (if applicable):

Tensorflow Version (if applicable):

PyTorch Version (if applicable):

Baremetal or Container (if so, version):

Relevant Files

Model link:

https://drive.google.com/file/d/1V3wZFEyO6s3szE6tPhofa-bkY0Lqwu8M/view?usp=drive_link

Steps To Reproduce

./TensorRT-10.3.0.26/bin/trtexec --onnx=test_sim.onnx  --fp16 --shapes=phone:1x898x768,phone_lengths:1,pitch:1x898,pitchf:1x898,ds:1,rnd:1x192x898 --saveEngine=test.engine --builderOptimizationLevel=5

[08/26/2024-08:17:24] [I] GPU Compute Time: min = 817.994 ms, max = 820.003 ms, mean = 818.733 ms, median = 818.609 ms, percentile(90%) = 819.845 ms, percentile(95%) 
= 820.003 ms, percentile(99%) = 820.003 ms

pytorch uses the same input/output size, plus pre and post processing, and only needs 300ms

CallmeZhangChenchen commented 2 weeks ago

[08/26/2024-08:05:39] [W] [TRT] Engine generation failed with backend strategy 4.
Error message: [randomFill.cpp::replaceFillNodesForMyelin::89] Error Code 2: Internal Error (Assertion node->backend == Backend::kMYELIN failed. ).
Skipping this backend strategy.

There was a warring when the model was converted

CallmeZhangChenchen commented 2 weeks ago

I think I found out why I'll take the time to study it

moraxu commented 2 weeks ago

According to the issue, the problem seems to be with the node that was offloaded to one of our backend DL graph compilers so we can investigate it internally. Can you confirm the source of the screenshot showing the ForeignNode?

CallmeZhangChenchen commented 1 week ago

I think I found out why I'll take the time to study it @moraxu Thanks for your attention.

Using nsys profile -o analysis_test trtexec *** , I exported an analysis file and then opened it with Nsight. There was a time-consuming operation that took 1.6s

The main time consuming point is between input op pitchf and /dec/m_source/l_tanh/Tanh, so my solution is to move this part off the network for now and use torch to reason

moraxu commented 1 week ago

@CallmeZhangChenchen , sorry for the late follow up - is this on Windows 10 or 11? If not, could you provide a specific OS version for us to reproduce?

CallmeZhangChenchen commented 5 days ago

@moraxu Thanks！ OS version: Ubuntu 22.04.4 LTS

moraxu commented 4 days ago

I've instanced an internal bug, thank you.

moraxu commented 3 days ago

@CallmeZhangChenchen , could you provide pytorch inference script as well? The issue is about comparison with pytorch, could be that TRT has bug or could be that the pytorch script is not actually doing the same workload.

Could you also provide the full trtexec --verbose log from your end, if possible?

CallmeZhangChenchen commented 3 days ago

@moraxu I may not be able to provide a complete running pytorch script, because I have optimized the code here, and now the model has been reduced from 800ms to 27ms.

The original pytorch project, https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI

Transferring onnx script may not be so smooth, https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI/blob/main/infer/modules/onnx/export.py

trtexec --verbose log https://drive.google.com/file/d/1Uc_m2gP9QhjussV-rkJRsLPp7AdE2XLE/view?usp=drive_link

Get rid of time-consuming code， https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI/blob/main/infer/lib/infer_pack/models_onnx.py

def forward(self, x, upp=None):
        #sine_wavs, uv, _ = self.l_sin_gen(x, upp)
        #if self.is_half:
        #    sine_wavs = sine_wavs.half()
        #sine_merge = self.l_tanh(self.l_linear(sine_wavs))
        sine_merge = self.l_tanh(self.l_linear(x))
        return sine_merge, None, None  # noise, uv

This part takes a few ms using pytorch

Cropped onnx, https://drive.google.com/file/d/1ucjIDLpJfOMFIWVY8NKav6fa05KF4icd/view?usp=drive_link

moraxu commented 2 days ago

Thank you, I'll pass the info on

NVIDIA / TensorRT