fp16 output mismatch - Githubissues

DingJuPeng1 commented 2 years ago

CUDA:11.1 TensorRT:7.2.1.6 CUDNN:8.0

At first I use trtexec to convert a onnx model to trt file with the flag "--fp16", then I serialized the trt file and create engine and context, the output is mismatched with the "--fp32" model infer results and the final result is wrong too, what is the right way to use the "--fp16" mode in trt ?

Looking forward to reply.

DingJuPeng1 commented 2 years ago

because I run normally in fp32 mode， so all I have changed is to add a "--fp16" flag when I use trtexec convert onnx to trt file.

zerollzeng commented 2 years ago

Can you share the onnx file here?

DingJuPeng1 commented 2 years ago

I'm sorry that onnx file can't share because of company's rule. can u offer me the right process to use infer with fp16(onnx->trt->engine)? thanks.

nvpohanh commented 2 years ago

@DingJuPeng1 You can try using Polygraphy tool (https://github.com/NVIDIA/TensorRT/blob/main/tools/Polygraphy/how-to/debug_accuracy.md ) to see which layer(s) produce wrong results. Also, I would suggest that you try newer TRT version like TRT 8.4 GA (8.4.1).

Another possibility is to make sure the input tensor values and the weights are in "reasonable ranges" (such as in [-1.0, 1.0]) because larger values tend to overflow FP16.

DingJuPeng1 commented 2 years ago

thank u , I'd try these later

deephog commented 2 years ago

I ran into similar problem. Before using 8.4 GA, TRT never reported the issue, so I never realized how different FP16 model performs, because it gives very similar results in most of the cases, but suddenly fails badly when values touches the bound of FP16 and get clamped.

Now with 8.4GA, when compiling the engine, it explicitly tells which layers are problematic by saying:

[07/18/2022-18:35:34] [TRT] [W] - Subnormal FP16 values detected. [07/18/2022-18:35:34] [TRT] [W] If this is not the desired behavior, please modify the weights or retrain with regularization to reduce the magnitude of the weights.

My questions come as:

I understand that we have the option of manually setting the precision for each layer. But if there are too many layers that have this kind of issue, is there an "once for all" flag to keep all these layers untouched as FP32?
It seems like we are treating FP16 even more conservatively than INT8, since there are multiple reports about FP16 results being inaccurate, but there isn't any info about "FP16 quantization or calibration", as I imagine it should be much easier than INT8 quantization. Or, does the current INT8 calibration scheme also work for FP16?
As the warning message mentions, what should be the way if we want our model to be trained as strictly FP16 compatible? My model is trained as mixed precision already, but there aren't many info about FP16 training or FP32 training with forced FP16 value range.

Thanks!

SolDogLi commented 2 years ago

hello deephog, Can you share more about “I understand that we have the option of manually setting the precision for each layer. But if there are too many layers that have this kind of issue, is there an "once for all" flag to keep all these layers untouched as FP32?”？ I added the following code to onnx-tensorrt to set all layers as FP32 , but it doesn't seem to work: for(int i = 0; i < trt_network->getNbLayers(); i ++) { auto layer = trt_network->getLayer(i);
std::string layerName = layer->getName(); cout << "process " << layerName << endl; auto layer_type = layer->getType();

auto layer_precision = layer->getPrecision();

if(layer_type == nvinfer1::LayerType::kSHAPE || layer_type == nvinfer1::LayerType::kIDENTITY || layer_type == nvinfer1::LayerType::kSHUFFLE || layer_type == nvinfer1::LayerType::kSLICE || layer_type == nvinfer1::LayerType::kCONCATENATION){ continue; } if(layer_precision == nvinfer1::DataType::kINT32){ continue; } if(layerName == "Tile"){ continue; }

layer->setPrecision(nvinfer1::DataType::kFLOAT);
cout << "Set " << layerName << " to FP32 mode " << endl; } Thanks!

NVIDIA / TensorRT

fp16 output mismatch #2082