NVIDIA / TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
https://developer.nvidia.com/tensorrt
Apache License 2.0
10.61k stars 2.11k forks source link

In trt10.0.1, these two APIs: setPrecision and setOutputType do not work #3941

Open 2730gf opened 3 months ago

2730gf commented 3 months ago

Description

We have a model that overflows when using fp16, so we use layer-precision to limit it and let some layers use fp32. It worked in version 8.6 and we could infer normal results. But after upgrading to 10.0.1, we found that the model output overflowed. Using polygraphy, we found that nan was already generated at the first overflow location (Is setprecison and setoutputType invalid?)

Environment

TensorRT Version: 10.0.1 NVIDIA GPU: 3090 & 3080 NVIDIA Driver Version: 550 CUDA Version: cuda-12.2

Steps To Reproduce

my code is like this:

for (int32_t layerIdx = 0; layerIdx < network.getNbLayers(); ++layerIdx) {
    auto *layer = network.getLayer(layerIdx);
    auto const layerName = layer->getName();
    nvinfer1::DataType dataType;
    if (matchLayerPrecision(layerPrecisions, layerName, &dataType)) { // Function to determine whether to limit the precision
        layer->setPrecision(dataType);
        int32_t layerOutNb = layer->getNbOutputs();
        for (int32_t outputIdx = 0; outputIdx < layerOutNb; outputIdx++) {
            layer->setOutputType(outputIdx, dataType);
        }}}

By the way, I have already set kOBEY_PRECISIONCONSTRAINTS `env.config->setFlag(nvinfer1::BuilderFlag::kOBEY_PRECISION_CONSTRAINTS); `

lix19937 commented 3 months ago

I suggest use trtexec --layerOutputTypes=spec --layerPrecisions=spec --precisionConstraints=spec --fp16 --verbose --onnx=spec

2730gf commented 3 months ago

I have done this, and it works on 8.6, but fails on 10.0.1:

export layer_precision="p2o.Pow.0:fp32,p2o.Pow.2:fp32..."
trtexec  --fp16 --onnx=sample.onnx --precisionConstraints="obey" --layerPrecisions=${layer_precision} --layerOutputTypes=${layer_precision}  --saveEngine=sample.trt
trtexec --loadEngine=sample.trt  --dumpOutput --loadInputs=... 
lix19937 commented 3 months ago

On trt10.0.1, try to use

trtexec  --fp16 --onnx=sample.onnx --precisionConstraints="obey" --layerPrecisions=${layer_precision} --layerOutputTypes=${layer_precision}  --saveEngine=sample.trt --builderOptimizationLevel=5
2730gf commented 3 months ago

I have added --builderOptimizationLevel=5, but it still overflows

lix19937 commented 3 months ago

You can compare the tactic between two version.

2730gf commented 3 months ago

Thank you very much for your reply,after setting builderOptimizationLevel to 5, cache cannot be generated in trt86, but can be generated in trt10. In trt10, I can see the name of the strategy is: sm80_xmma_gemm_f32f32_f32f32_f32_nn_n_tilesize32x32x8_stage3_warpsize1x2x1_ffma_aligna4_alignc4; from the name, this is already a kernel using fp32? Is there any other way to continue to locate the problem?

ttyio commented 1 month ago

@2730gf have you also tried strongly typed network, see

https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#strongly-typed-networks

thanks!

2730gf commented 1 month ago

@ttyio Thank you for your reply. I found that after turning on this option, bf16 precision inference will be used instead of fp16. Compared with fp16, although there is no overflow, there is also a lot of loss in latency. Is there a way to accurately limit the precision?