Inference failure of TensorRT 10.0.x.x when running my internal model on GPU(T4, A100)

kimdwkimdw commented 1 month ago

Description

After updating to TensorRT 10.0.1.6, we expected the previously reported issue to be resolved. Unfortunately, not only does the issue persist, but the model’s outputs have deteriorated even further. Specifically, all output values are now nan, making it impossible to use our models. This issue affects both fp16 and fp32 precision settings, rendering the model completely non-functional.

https://github.com/NVIDIA/TensorRT/issues/3292

Environment

TensorRT Version: All version of 10.0.x.x. NGC Container 24.05~24.07.

NVIDIA GPU: T4, A100

NVIDIA Driver Version: 550.90.07

CUDA Version: 12.4

CUDNN Version: x

Operating System:

Container (if so, version): NGC Container from 23.03 and 24.07. https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tensorrt

Operating System:

Python Version (if applicable):

Tensorflow Version (if applicable):

PyTorch Version (if applicable):

Baremetal or Container (if so, version):

Relevant Files

Model link:

Steps To Reproduce

Commands or scripts:

Have you tried the latest release?:

Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (polygraphy run <model.onnx> --onnxrt):

fail on polygraphy

lix19937 commented 1 month ago

Can you provide the full log with trtexec --verbose ?

kimdwkimdw commented 1 month ago

@lix19937 Can I send this to you via email? I prefer not to expose my model publicly. The --verbose option also reveals too much information.

lix19937 commented 1 month ago

@kimdwkimdw ok, my email is jevenlee2016@foxmail.com

kimdwkimdw commented 1 month ago

@lix19937 i've sent it with gzipped

lix19937 commented 1 month ago

@kimdwkimdw I didn't receive your e-mail.

kimdwkimdw commented 1 month ago

@kimdwkimdw I didn't receive your e-mail.

Please check your spam mail inbox.

I've sent mail via Gmail.

lix19937 commented 3 weeks ago

Can you upload the goole drive ?

kimdwkimdw commented 3 weeks ago

Can you upload the goole drive ?

OK I upload log file to gdrive and shared with your email

kimdwkimdw commented 1 day ago

Any updates?

lix19937 commented 1 day ago

@kimdwkimdw So sorry ! Current I has no env, can you upload the full log with trtexec --verbose 2>&1 |tee full_log to google drive to share me ? I will analysis it as soon as possible.

kimdwkimdw commented 1 day ago

@lix19937 let me know your google address for Google drive. Your email address jevenlee2016@foxmail.com seems doesn't work.

kimdwkimdw commented 1 day ago

This is same kind of issue from https://github.com/NVIDIA/TensorRT/issues/3292

TensorRT 10.x have significant errors.

cc. @zerollzeng @ttyio

lix19937 commented 1 day ago

@lix19937 let me know your google address for Google drive. Your email address jevenlee2016@foxmail.com seems doesn't work.

sent log to email hblijinwen@126.com

kimdwkimdw commented 1 day ago

@lix19937 I've sent it to hblijinwen@126.com

lix19937 commented 22 hours ago

@kimdwkimdw

but the model’s outputs have deteriorated even further. Specifically, all output values are now nan, making it impossible to use our models.

From your logs, it has no valid errors or warnings. Maybe you can use polygraphy, like follow

polygraphy run model.onnx --trt  --onnxrt --input-shapes source:[2,160000] wav_lens:[2,1]

to check which layer begin to arise the big nan/diff, check whether a BN after conv, etc. Also you can check the weights max-min range.

Another hand, you can try to use latest version.

kimdwkimdw commented 22 hours ago

@lix19937

Thank you for the suggestion, but I have already tried using polygraphy along with other methods. My question goes back to the root of the issue: why do nan values in the relative difference output from polygraphy start appearing when using TensorRT version 10.x?

I did not encounter this issue with TensorRT 8.5.3, including versions like 23.02 and 23.03, where there were no nan values. However, starting from TensorRT 8.6.1.6, the errors have become more pronounced, and with all versions of TensorRT 10 (e.g., 10.3.0.26, 10.2.0, 10.1.0), the model’s errors seem to overflow dramatically.

lix19937 commented 22 hours ago

In my opinion, form 8.6, tensorrt add more feature, like builder optimization level(the default optimization level is 3. Valid values include integers from 0 to the maximum optimization level 5), and import llm layers fusion(like mha, ln), for normalization layers, particularly if deploying with mixed precision, target the latest ONNX opset that contains the corresponding function ops, for example: opset 17 for LayerNormalization or opset 18 GroupNormalization. Numerical accuracy using function ops is superior to corresponding implementation with primitive ops for normalization layers. And move cuda-x lib depends.

You can try to add --builderOptimizationLevel=5 --noTF32 and adjust the size of --memPoolSize . @kimdwkimdw

NVIDIA / TensorRT