NVIDIA / TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
https://developer.nvidia.com/tensorrt
Apache License 2.0
10.68k stars 2.12k forks source link

BERT fp16 accuracy problem #1196

Closed chenzhanyiczy closed 2 years ago

chenzhanyiczy commented 3 years ago

Description

When using trt to build an fp16 model, in inference, the accuracy is too different from fp32. The model is BERT base. why?

Environment

TensorRT Version: 7.2.1 NVIDIA GPU: T4 NVIDIA Driver Version: 440.59 CUDA Version: 10.2 CUDNN Version: 8.0.4 Operating System: centos7 Python Version (if applicable): 3.6 Tensorflow Version (if applicable): 1.15.4 PyTorch Version (if applicable): Baremetal or Container (if so, version):

Steps To Reproduce

Proceed as follows: 1、tf(freeze mode) -> onnx(version: 1.8.1) -> trt engine 2、when trt building, set these parameters: with builder.create_builder_config() as config: config.flags = config.flags | 1<<int(trt.BuilderFlag.FP16) ... 3、at the same time, I also tried to set the accuracy on these layers(such as: LayerNorm/moments/SquaredDifference、intermediate/dense/Erf、pooler/dense/Tanh、query_head_contrastive/Relu and so on):
network.get_layer(i).precision = trt.DataType.FLOAT BUT no effect

I also found a very strange place: when I was in layer0 and layer1, I compared the accuracy is not much different, but when in layer2, there is a big difference. This model has 12 layers, each layer has the same structure

ttyio commented 3 years ago

Hello @chenzhanyiczy , What's your metric to verify the accuracy? usually we need some benchmark e.g, F1 score on SQuAD. Sometimes the bit level mismatch won't hurt the final accuracy.

Also have your set strict types when try mixed precision?

    config.flags = config.flags | 1<<int(trt.BuilderFlag.STRICT_TYPES)

if we want to experiment on accuracy sensitive layers, sometime we might also need set input (the output of previous layer) as FP32

    prev_layer.get_output(0).dtype = trt.DataType.FLOAT

Other experiments worth doing is that, generate ONNX model with FP16 weights, try run on onnxruntime. This is the upper bar you can get if you run you model in all FP16 precision. You can focus on try more layers run on FP32 precision to meet higher accuracy requirement.

ttyio commented 3 years ago

@chenzhanyiczy , could you do another experiment to make the whole gelu expression run on FP32 precision? thanks!

chenzhanyiczy commented 3 years ago

Hello @chenzhanyiczy , What's your metric to verify the accuracy? usually we need some benchmark e.g, F1 score on SQuAD. Sometimes the bit level mismatch won't hurt the final accuracy.

Also have your set strict types when try mixed precision?

    config.flags = config.flags | 1<<int(trt.BuilderFlag.STRICT_TYPES)

if we want to experiment on accuracy sensitive layers, sometime we might also need set input (the output of previous layer) as FP32

    prev_layer.get_output(0).dtype = trt.DataType.FLOAT

Other experiments worth doing is that, generate ONNX model with FP16 weights, try run on onnxruntime. This is the upper bar you can get if you run you model in all FP16 precision. You can focus on try more layers run on FP32 precision to meet higher accuracy requirement.

yes, I also use this flag(STRICT_TYPES) and set previous layer output type to FLOAT, but accuracy also has much diff.

The builder's code is similar to the following. Here suppose I want to check the output of layer_2/output/LayerNorm/moments/variance of layer_2. The previous node of this node is SquaredDifference. The strange thing is that the output of this node(variance) of layer0 and layer1 is ok. In other words, their accuray is good.

if network.get_layer(i).name.find("output/LayerNorm/moments/SquaredDifference") != -1 \ or network.get_layer(i).name.find("intermediate/dense/Erf") != -1: for idx in range(network.get_layer(i).num_outputs): network.get_layer(i).set_output_type(idx, trt.DataType.FLOAT) network.get_layer(i).precision = trt.DataType.FLOAT .... with builder.create_builder_config() as config: config.flags = config.flags | 1<<int(trt.BuilderFlag.FP16) .... with builder.build_engine(network, config) as engine: ...

The network structure is like this: query_emb_model2 (1)

How does onnx generate fp16 weights? I did not find that onnx has such a function, can you provide it for reference or script? thanks

chenzhanyiczy commented 3 years ago

@chenzhanyiczy , could you do another experiment to make the whole gelu expression run on FP32 precision? thanks!

no, because the output of layer2 is now different. And the activation function in the pooler layer is TanH, not gelu.

ttyio commented 3 years ago

@chenzhanyiczy the onnx fp16 generation should looks like in pytorch this https://github.com/onnx/onnx-tensorrt/issues/235

I see you have erf, is it for gelu? If it is hard to match patterns, you could try mark all the tanh , pow, softmax to run on FP32 precision.

chenzhanyiczy commented 3 years ago

@chenzhanyiczy the onnx fp16 generation should looks like in pytorch this onnx/onnx-tensorrt#235

I see you have erf, is it for gelu? If it is hard to match patterns, you could try mark all the tanh , pow, softmax to run on FP32 precision.

I use tensorflow. Do you have an example of tensorflow? I try to do these, such as: set all the ops of the batchNorm part to fp32, but no effect. The result of layer0 and layer1 ok(refer to the above), but why the result of layer2 is different? Their structure is the same

chenzhanyiczy commented 3 years ago

@ttyio Do you have an example of generate bert engine through trt automatic conversion? not like demo bert

ttyio commented 3 years ago

@chenzhanyiczy

The result of layer0 and layer1 ok(refer to the above), but why the result of layer2 is different?

Have your checked the output data range distribution for each layers in each encoder, Is it possible that encoder0 and encoder1 is within the fp16 range, but we overflow fp16 start from encoder2?

Do you have an example of tensorflow?

Sorry no.

I try to do these, such as: set all the ops of the batchNorm part to fp32, but no effect.

Not batchNorm, could you set FP32 for the tanh, pow, softmax?

Do you have an example of generate bert engine through trt automatic conversion? not like demo bert

Sorry no.

chenzhanyiczy commented 3 years ago

@ttyio

Have your checked the output data range distribution for each layers in each encoder, Is it possible that encoder0 and encoder1 is within the fp16 range, but we overflow fp16 start from encoder2?

The structure of each layer is probably attention -> intermediate -> output, just like bert-base. I check the output in layer_2/output/LayerNorm/moments/SquaredDifference under fp32 and fp16 respectively, they are basically the same. BUT the output in layer_2/output/LayerNorm/moments/variance are totally different(infinitesimal under fp16) image

could you set FP32 for the tanh, pow, softmax?

yes, no effect

ttyio commented 3 years ago

@chenzhanyiczy , do you have the verbose log when tanh, pow and softmax all in fp32? I want to make sure these nodes are really run in FP32 precision.

chenzhanyiczy commented 3 years ago

do you have the verbose log when tanh, pow and softmax all in fp32? I want to make sure these nodes are really run in FP32 precision.

The verbose file is very large, take softmax as an example:

[TensorRT] VERBOSE: --------------- Timing Runner: text/bert/encoder/layer_0/attention/self/Softmax (SoftMax) [TensorRT] VERBOSE: --------------- Timing Runner: text/bert/encoder/layer_0/attention/self/Softmax (ExtSoftMax) [TensorRT] VERBOSE: >>>>>>>>>>>>>>> Chose Runner Type: ExtSoftMax Tactic: 0 [TensorRT] VERBOSE: --------------- Timing Runner: text/bert/encoder/layer_0/attention/self/Softmax (SoftMax) [TensorRT] VERBOSE: --------------- Timing Runner: text/bert/encoder/layer_0/attention/self/Softmax (ExtSoftMax) [TensorRT] VERBOSE: >>>>>>>>>>>>>>> Chose Runner Type: ExtSoftMax Tactic: 0 [TensorRT] VERBOSE: --------------- Timing Runner: text/bert/encoder/layer_1/attention/self/Softmax (SoftMax) [TensorRT] VERBOSE: --------------- Timing Runner: text/bert/encoder/layer_1/attention/self/Softmax (ExtSoftMax) [TensorRT] VERBOSE: >>>>>>>>>>>>>>> Chose Runner Type: ExtSoftMax Tactic: 0 [TensorRT] VERBOSE: --------------- Timing Runner: text/bert/encoder/layer_1/attention/self/Softmax (SoftMax) [TensorRT] VERBOSE: --------------- Timing Runner: text/bert/encoder/layer_1/attention/self/Softmax (ExtSoftMax) [TensorRT] VERBOSE: >>>>>>>>>>>>>>> Chose Runner Type: ExtSoftMax Tactic: 0 [TensorRT] VERBOSE: --------------- Timing Runner: text/bert/encoder/layer_2/attention/self/Softmax (SoftMax) [TensorRT] VERBOSE: --------------- Timing Runner: text/bert/encoder/layer_2/attention/self/Softmax (ExtSoftMax) [TensorRT] VERBOSE: >>>>>>>>>>>>>>> Chose Runner Type: ExtSoftMax Tactic: 0 [TensorRT] VERBOSE: --------------- Timing Runner: text/bert/encoder/layer_2/attention/self/Softmax (SoftMax) [TensorRT] VERBOSE: --------------- Timing Runner: text/bert/encoder/layer_2/attention/self/Softmax (ExtSoftMax) [TensorRT] VERBOSE: >>>>>>>>>>>>>>> Chose Runner Type: ExtSoftMax Tactic: 0 [TensorRT] VERBOSE: For layer text/bert/encoder/layer_0/attention/self/Softmax a non-conforming implementation was chosen than was requested i.e. requested layer computation precision and output precision types were ignored because it resulted in faster network performance. Enable strict mode to try force choose a conforming implementation. [TensorRT] VERBOSE: For layer text/bert/encoder/layer_1/attention/self/Softmax a non-conforming implementation was chosen than was requested i.e. requested layer computation precision and output precision types were ignored because it resulted in faster network performance. Enable strict mode to try force choose a conforming implementation. [TensorRT] VERBOSE: For layer text/bert/encoder/layer_2/attention/self/Softmax a non-conforming implementation was chosen than was requested i.e. requested layer computation precision and output precision types were ignored because it resulted in faster network performance. Enable strict mode to try force choose a conforming implementation. [TensorRT] VERBOSE: Debug synchronize completed successfully after build for layer: text/bert/encoder/layer_0/attention/self/Softmax (type=ExtSoftMax, tactic=0) [TensorRT] VERBOSE: Debug synchronize completed successfully after build for layer: text/bert/encoder/layer_1/attention/self/Softmax (type=ExtSoftMax, tactic=0) [TensorRT] VERBOSE: Debug synchronize completed successfully after build for layer: text/bert/encoder/layer_2/attention/self/Softmax (type=ExtSoftMax, tactic=0) [TensorRT] VERBOSE: Layer: text/bert/encoder/layer_0/attention/self/Softmax Weights: 0 HostPersistent: 0 DevicePersistent: 0 [TensorRT] VERBOSE: Layer: text/bert/encoder/layer_1/attention/self/Softmax Weights: 0 HostPersistent: 0 DevicePersistent: 0 [TensorRT] VERBOSE: Layer: text/bert/encoder/layer_2/attention/self/Softmax Weights: 0 HostPersistent: 0 DevicePersistent: 0 [TensorRT] VERBOSE: Layer(ExtSoftMax): text/bert/encoder/layer_0/attention/self/Softmax, Tactic: 0, (Unnamed Layer 212) [Shuffle]_output[Half(32)] -> (Unnamed Layer 213) [Softmax]_output[Half(32)] [TensorRT] VERBOSE: Layer(MatrixMultiply): text/bert/encoder/layer_0/attention/self/MatMul_1, Tactic: 1, text/bert/encoder/layer_0/attention/self/Softmax:0[Half(12,32,32)], text/bert/encoder/layer_0/attention/self/transpose_2:0[Half(12,32,64)] -> text/bert/encoder/layer_0/attention/self/MatMul_1:0[Half(12,32,64)] [TensorRT] VERBOSE: Layer(ExtSoftMax): text/bert/encoder/layer_1/attention/self/Softmax, Tactic: 0, (Unnamed Layer 381) [Shuffle]_output[Half(32)] -> (Unnamed Layer 382) [Softmax]_output[Half(32)] [TensorRT] VERBOSE: Layer(MatrixMultiply): text/bert/encoder/layer_1/attention/self/MatMul_1, Tactic: 1, text/bert/encoder/layer_1/attention/self/Softmax:0[Half(12,32,32)], text/bert/encoder/layer_1/attention/self/transpose_2:0[Half(12,32,64)] -> text/bert/encoder/layer_1/attention/self/MatMul_1:0[Half(12,32,64)] [TensorRT] VERBOSE: Layer(ExtSoftMax): text/bert/encoder/layer_2/attention/self/Softmax, Tactic: 0, (Unnamed Layer 550) [Shuffle]_output[Half(32)] -> (Unnamed Layer 551) [Softmax]_output[Half(32)] [TensorRT] VERBOSE: Layer(MatrixMultiply): text/bert/encoder/layer_2/attention/self/MatMul_1, Tactic: 1, text/bert/encoder/layer_2/attention/self/Softmax:0[Half(12,32,32)], text/bert/encoder/layer_2/attention/self/transpose_2:0[Half(12,32,64)] -> text/bert/encoder/layer_2/attention/self/MatMul_1:0[Half(12,32,64)]

build code as follows: if network.get_layer(i).name.find("attention/self/Softmax") != -1 : for idx in range(network.get_layer(i).num_outputs): network.get_layer(i).set_output_type(idx, trt.DataType.FLOAT) network.get_layer(i).precision = trt.DataType.FLOAT .... config.flags = config.flags | 1<<int(trt.BuilderFlag.FP16) ...

plus the code: config.flags = config.flags | 1<<int(trt.BuilderFlag.STRICT_TYPES) softmax verbose output :

[TensorRT] VERBOSE: --------------- Timing Runner: text/bert/encoder/layer_0/attention/self/Softmax (SoftMax) [TensorRT] VERBOSE: --------------- Timing Runner: text/bert/encoder/layer_0/attention/self/Softmax (ExtSoftMax) [TensorRT] VERBOSE: >>>>>>>>>>>>>>> Chose Runner Type: ExtSoftMax Tactic: 0 [TensorRT] VERBOSE: --------------- Timing Runner: text/bert/encoder/layer_0/attention/self/Softmax (SoftMax) [TensorRT] VERBOSE: --------------- Timing Runner: text/bert/encoder/layer_0/attention/self/Softmax (ExtSoftMax) [TensorRT] VERBOSE: >>>>>>>>>>>>>>> Chose Runner Type: ExtSoftMax Tactic: 0 [TensorRT] VERBOSE: --------------- Timing Runner: text/bert/encoder/layer_1/attention/self/Softmax (SoftMax) [TensorRT] VERBOSE: --------------- Timing Runner: text/bert/encoder/layer_1/attention/self/Softmax (ExtSoftMax) [TensorRT] VERBOSE: >>>>>>>>>>>>>>> Chose Runner Type: ExtSoftMax Tactic: 0 [TensorRT] VERBOSE: --------------- Timing Runner: text/bert/encoder/layer_1/attention/self/Softmax (SoftMax) [TensorRT] VERBOSE: --------------- Timing Runner: text/bert/encoder/layer_1/attention/self/Softmax (ExtSoftMax) [TensorRT] VERBOSE: >>>>>>>>>>>>>>> Chose Runner Type: ExtSoftMax Tactic: 0 [TensorRT] VERBOSE: --------------- Timing Runner: text/bert/encoder/layer_2/attention/self/Softmax (SoftMax) [TensorRT] VERBOSE: --------------- Timing Runner: text/bert/encoder/layer_2/attention/self/Softmax (ExtSoftMax) [TensorRT] VERBOSE: >>>>>>>>>>>>>>> Chose Runner Type: ExtSoftMax Tactic: 0 [TensorRT] VERBOSE: --------------- Timing Runner: text/bert/encoder/layer_2/attention/self/Softmax (SoftMax) [TensorRT] VERBOSE: --------------- Timing Runner: text/bert/encoder/layer_2/attention/self/Softmax (ExtSoftMax) [TensorRT] VERBOSE: >>>>>>>>>>>>>>> Chose Runner Type: ExtSoftMax Tactic: 0 [TensorRT] VERBOSE: Adding reformat layer: text/bert/encoder/layer_0/attention/self/Softmax reformatted input 0 ((Unnamed Layer 212) [Shuffle]_output) from Half(1,32) to Float(1,32) [TensorRT] VERBOSE: Adding reformat layer: (Unnamed Layer 214) [Shuffle] reformatted input 0 ((Unnamed Layer 213) [Softmax]_output) from Float(1,32) to Half(1,32) [TensorRT] VERBOSE: Adding reformat layer: text/bert/encoder/layer_1/attention/self/Softmax reformatted input 0 ((Unnamed Layer 381) [Shuffle]_output) from Half(1,32) to Float(1,32) [TensorRT] VERBOSE: Adding reformat layer: (Unnamed Layer 383) [Shuffle] reformatted input 0 ((Unnamed Layer 382) [Softmax]_output) from Float(1,32) to Half(1,32) [TensorRT] VERBOSE: Adding reformat layer: text/bert/encoder/layer_2/attention/self/Softmax reformatted input 0 ((Unnamed Layer 550) [Shuffle]_output) from Half(1,32) to Float(1,32) [TensorRT] VERBOSE: Adding reformat layer: (Unnamed Layer 552) [Shuffle] reformatted input 0 ((Unnamed Layer 551) [Softmax]_output) from Float(1,32) to Half(1,32) [TensorRT] VERBOSE: Debug synchronize completed successfully after build for layer: text/bert/encoder/layer_0/attention/self/Softmax input reformatter 0 (type=Reformat, tactic=0) [TensorRT] VERBOSE: Debug synchronize completed successfully after build for layer: text/bert/encoder/layer_0/attention/self/Softmax (type=ExtSoftMax, tactic=0) [TensorRT] VERBOSE: Debug synchronize completed successfully after build for layer: text/bert/encoder/layer_1/attention/self/Softmax input reformatter 0 (type=Reformat, tactic=0) [TensorRT] VERBOSE: Debug synchronize completed successfully after build for layer: text/bert/encoder/layer_1/attention/self/Softmax (type=ExtSoftMax, tactic=0) [TensorRT] VERBOSE: Debug synchronize completed successfully after build for layer: text/bert/encoder/layer_2/attention/self/Softmax input reformatter 0 (type=Reformat, tactic=0) [TensorRT] VERBOSE: Debug synchronize completed successfully after build for layer: text/bert/encoder/layer_2/attention/self/Softmax (type=ExtSoftMax, tactic=0) [TensorRT] VERBOSE: Layer: text/bert/encoder/layer_0/attention/self/Softmax input reformatter 0 Weights: 0 HostPersistent: 0 DevicePersistent: 0 [TensorRT] VERBOSE: Layer: text/bert/encoder/layer_0/attention/self/Softmax Weights: 0 HostPersistent: 0 DevicePersistent: 0 [TensorRT] VERBOSE: Layer: text/bert/encoder/layer_1/attention/self/Softmax input reformatter 0 Weights: 0 HostPersistent: 0 DevicePersistent: 0 [TensorRT] VERBOSE: Layer: text/bert/encoder/layer_1/attention/self/Softmax Weights: 0 HostPersistent: 0 DevicePersistent: 0 [TensorRT] VERBOSE: Layer: text/bert/encoder/layer_2/attention/self/Softmax input reformatter 0 Weights: 0 HostPersistent: 0 DevicePersistent: 0 [TensorRT] VERBOSE: Layer: text/bert/encoder/layer_2/attention/self/Softmax Weights: 0 HostPersistent: 0 DevicePersistent: 0 [TensorRT] VERBOSE: Layer(Reformat): text/bert/encoder/layer_0/attention/self/Softmax input reformatter 0, Tactic: 0, (Unnamed Layer 212) [Shuffle]_output[Half(32)] -> text/bert/encoder/layer_0/attention/self/Softmax reformatted input 0[Float(32)] [TensorRT] VERBOSE: Layer(ExtSoftMax): text/bert/encoder/layer_0/attention/self/Softmax, Tactic: 0, text/bert/encoder/layer_0/attention/self/Softmax reformatted input 0[Float(32)] -> (Unnamed Layer 213) [Softmax]_output[Float(32)] [TensorRT] VERBOSE: Layer(Reformat): (Unnamed Layer 214) [Shuffle] input reformatter 0, Tactic: 0, (Unnamed Layer 213) [Softmax]_output[Float(32)] -> (Unnamed Layer 214) [Shuffle] reformatted input 0[Half(32)] [TensorRT] VERBOSE: Layer(MatrixMultiply): text/bert/encoder/layer_0/attention/self/MatMul_1, Tactic: 1, text/bert/encoder/layer_0/attention/self/Softmax:0[Half(12,32,32)], text/bert/encoder/layer_0/attention/self/transpose_2:0[Half(12,32,64)] -> text/bert/encoder/layer_0/attention/self/MatMul_1:0[Half(12,32,64)] [TensorRT] VERBOSE: Layer(Reformat): text/bert/encoder/layer_1/attention/self/Softmax input reformatter 0, Tactic: 0, (Unnamed Layer 381) [Shuffle]_output[Half(32)] -> text/bert/encoder/layer_1/attention/self/Softmax reformatted input 0[Float(32)] [TensorRT] VERBOSE: Layer(ExtSoftMax): text/bert/encoder/layer_1/attention/self/Softmax, Tactic: 0, text/bert/encoder/layer_1/attention/self/Softmax reformatted input 0[Float(32)] -> (Unnamed Layer 382) [Softmax]_output[Float(32)] [TensorRT] VERBOSE: Layer(Reformat): (Unnamed Layer 383) [Shuffle] input reformatter 0, Tactic: 0, (Unnamed Layer 382) [Softmax]_output[Float(32)] -> (Unnamed Layer 383) [Shuffle] reformatted input 0[Half(32)] [TensorRT] VERBOSE: Layer(MatrixMultiply): text/bert/encoder/layer_1/attention/self/MatMul_1, Tactic: 1, text/bert/encoder/layer_1/attention/self/Softmax:0[Half(12,32,32)], text/bert/encoder/layer_1/attention/self/transpose_2:0[Half(12,32,64)] -> text/bert/encoder/layer_1/attention/self/MatMul_1:0[Half(12,32,64)] [TensorRT] VERBOSE: Layer(Reformat): text/bert/encoder/layer_2/attention/self/Softmax input reformatter 0, Tactic: 0, (Unnamed Layer 550) [Shuffle]_output[Half(32)] -> text/bert/encoder/layer_2/attention/self/Softmax reformatted input 0[Float(32)] [TensorRT] VERBOSE: Layer(ExtSoftMax): text/bert/encoder/layer_2/attention/self/Softmax, Tactic: 0, text/bert/encoder/layer_2/attention/self/Softmax reformatted input 0[Float(32)] -> (Unnamed Layer 551) [Softmax]_output[Float(32)] [TensorRT] VERBOSE: Layer(Reformat): (Unnamed Layer 552) [Shuffle] input reformatter 0, Tactic: 0, (Unnamed Layer 551) [Softmax]_output[Float(32)] -> (Unnamed Layer 552) [Shuffle] reformatted input 0[Half(32)] [TensorRT] VERBOSE: Layer(MatrixMultiply): text/bert/encoder/layer_2/attention/self/MatMul_1, Tactic: 1, text/bert/encoder/layer_2/attention/self/Softmax:0[Half(12,32,32)], text/bert/encoder/layer_2/attention/self/transpose_2:0[Half(12,32,64)] -> text/bert/encoder/layer_2/attention/self/MatMul_1:0[Half(12,32,64)]

ttyio commented 3 years ago

@chenzhanyiczy How did you grep all the tanh, pow nodes? A general way might check

  network.get_layer(i).type

You can first only leave conv/gemm in fp16 precision, reset of the nodes all run in fp32.

chenzhanyiczy commented 3 years ago

@ttyio

How did you grep all the tanh, pow nodes?

these ops are in the pooler layer. The current accuracy is different in the layer_xxx layer

You can first only leave conv/gemm in fp16 precision, reset of the nodes all run in fp32.

I try this: if network.get_layer(i).type == trt.LayerType.FULLY_CONNECTED \ or network.get_layer(i).type == trt.LayerType.MATRIX_MULTIPLY \ or network.get_layer(i).type == trt.LayerType.SOFTMAX: network.get_layer(i).precision = trt.DataType.HALF ... other layer in fp32(default value). When building, still need to specify the fp16 flag, otherwise report these error: [TensorRT] ERROR: fp16 precision has been set for a layer or layer output, but fp16 is not configured in the builder But specifying fp16 causes all layers to run under fp16...

ttyio commented 3 years ago

@chenzhanyiczy here's the correct way to use mix precision:

  1. add FP16 flag and strict in builder config
  2. set higher precision for nodes that not want to run on lower precision

    if network.get_layer(i).type != trt.LayerType.FULLY_CONNECTED and ....:
        network.get_layer(i).precision = trt.DataType.FLOAT
        network.get_layer(i).set_output(0).dtype = trt.DataType.FLOAT
chenzhanyiczy commented 3 years ago

@ttyio I try it, the result of layer0 is also not accurate anymore. strict_type is to restrict trt selection, directly use fp16 type, more harmful. Can you try to use trt build automation to run bert-base? our model is also based on bert-base.

ttyio commented 3 years ago

@chenzhanyiczy If you set FP16 flag in the builder, and mark all layers as FP32 using the code in https://github.com/NVIDIA/TensorRT/issues/1196#issuecomment-822331771, the engine should run all layers in FP32, why it is more harmful?

chenzhanyiczy commented 3 years ago

@ttyio I try the following attempts (assume that the output is still layer_2/output/LayerNorm/moments/variance).

  1. fp16 mode + strict_type, FULLY_CONNECTED + MATRIX_MULTIPLY + SOFTMAX+... The precision and output are both fp16, and the remaining op precision and output are fp32

  2. fp16 mode, FULLY_CONNECTED + MATRIX_MULTIPLY + SOFTMAX+... The precision and output are both fp16, and the remaining op precision and output are fp32

  3. fp32 mode, FULLY_CONNECTED + MATRIX_MULTIPLY + SOFTMAX+... The precision and output are both fp16, the builder reports an error, the following error: [TensorRT] ERROR: fp16 precision has been set for a layer or layer output, but fp16 is not configured in the builder

Either way, the result is wrong. 2 is better than 1, because 1 is wrong in layer_0/output/LayerNorm/moments/variance, and 2 is wrong in layer_2/output/LayerNorm/moments/variance

I don't understand which object strict_type acts on? for example: config.flags = 1<<int(trt.BuilderFlag.FP16) | 1<<int(trt.BuilderFlag.STRICT_TYPES) if network.get_layer(i).type == trt.LayerType.SOFTMAX: network.get_layer(i).precision = trt.DataType.FLOAT network.get_layer(i).set_output_type(0, trt.DataType.FLOAT) ... Here strict_type restricts other layer precision and output are fp16? or the precision and output of softmax are fp32?

ttyio commented 3 years ago

@chenzhanyiczy let me explain strict_type, When we set the precision in builder_config, this tells TRT beside FP32, which precision is also allowed to run all the nodes in the network, and finally select the fastest kernels. When some layer has specified with precision, and trict_type not added, this only change the fusion logic in TRT, but trt will ignore the precision and still select the fastest kernels. When some layer has specified with precision, and trict_type added, the will also impact the final kernel selection, some kernel that match the user precision requirement will be selected, even if it is not the fastest one.

Back to your experiments, the precision setting in 2 is ignored in final kernel selection; 3 failed, and the error msg already tell us because some layer has fp16 requirement, but not enable fp_16 in build config.

chenzhanyiczy commented 3 years ago

@ttyio some confused.. For example, in fp16 flag + strict_type, I set the precision of softmax layer is fp32, like this; softmax(layer).precision = trt.DataType.FLOAT

  1. softmax will choose fp32 kernel ?
  2. for other ops that do not manually set the precision (the network parsed by onnxParse()), what is the behavior? Will all choose fp16 kernel?
ttyio commented 3 years ago

@chenzhanyiczy code should like this:

  softmax(layer).precision = trt.DataType.FLOAT
  softmax(layer).get_output(0).dtype = trt.DataType.FLOAT
  1. yes
  2. choose the fastest path
chenzhanyiczy commented 3 years ago

@ttyio thanks. So, strict_type is only valid for layers with manually set precision and output, right?

And back to the original case (the output of layer_2/output/LayerNorm/moments/variance), what should I do? I almost tried everything possible.

ttyio commented 3 years ago

Hello @chenzhanyiczy Since the FP32 precision works, so I suppose set both strict_type and FP16 in the builder flag, and mark all layer run on FP32 would also works. Then we can use this as base, move more layers into FP16 precision, finally we could get a network with mixed precision, all the sensitive layers run on FP32, and remaining run on FP16. This is the first step, you can first start with this, thanks!

chenzhanyiczy commented 3 years ago

@ttyio

Since the FP32 precision works, so I suppose set both strict_type and FP16 in the builder flag, and mark all layer run on FP32 would also works.

I tried strict_type + FP16 mode + all layer run on FP32(layer precision and output type) and FP32 mode two cases. The result of both are still big diff. Here is still used: the original case (the output of layer_2/output/LayerNorm/moments/variance). why is it so?

ttyio commented 3 years ago

@chenzhanyiczy , could you provide the verbose build log for the 2 runs? thanks.

chenzhanyiczy commented 3 years ago

@chenzhanyiczy , could you provide the verbose build log for the 2 runs? thanks.

@ttyio ok. The following files are fp32 mode(default behavior) and fp16 mode+strict_type+all layer fp32(precision and output type). Thanks. build_fp32_layer2_output_LayerNorm_moments_variance.tar.gz build_fp16_layer2_output_LayerNorm_moments_variance.tar.gz

ttyio commented 3 years ago

Hello @chenzhanyiczy , Check the Engine Layer Information section from the log, there are still layers not in fp32. some layer like the onehot plugin, I think you need only set the output type, because float is not acceptable as layer precision. the mm layer before after gelu is also in fp16, you could grep dense/Erf to find the gelu. and check the mm layer before and after, you can see it is from half to half, could you make sure they are all correctly set? thanks!

chenzhanyiczy commented 3 years ago

@ttyio yes, some are still fp16. Beacuse the builder has these warnings :

[TensorRT] WARNING: No implementation of layer text/bert/encoder/layer_2/intermediate/dense/MatMul + text/bert/encoder/layer_2/intermediate/dense/bias57 + (Unnamed Layer* 604) [Shuffle] + unsqueeze_node_after_text/bert/encoder/layer_2/intermediate/dense/bias57 + (Unnamed Layer* 604) [Shuffle] + text/bert/encoder/layer_2/intermediate/dense/BiasAdd obeys the requested constraints in strict mode. No conforming implementation was found i.e. requested layer computation precision and output precision types are ignored, using the fastest implementation

And some reformats are automatically added:

[TensorRT] VERBOSE: Adding reformat layer: PWN(PWN(PWN(text/bert/encoder/layer_0/intermediate/dense/add/x:0_14 + (Unnamed Layer 442) [Shuffle], PWN(PWN(PWN(text/bert/encoder/layer_1/intermediate/dense/Sqrt__41_13 + (Unnamed Layer 438) [Shuffle], text/bert/encoder/layer_1/intermediate/dense/truediv), text/bert/encoder/layer_1/intermediate/dense/Erf), text/bert/encoder/layer_1/intermediate/dense/add)), PWN(text/bert/encoder/layer_0/intermediate/dense/mul/x:0_15 + (Unnamed Layer* 445) [Shuffle], text/bert/encoder/layer_1/intermediate/dense/mul)), text/bert/encoder/layer_1/intermediate/dense/mul_1) reformatted input 0 (text/bert/encoder/layer_1/intermediate/dense/BiasAdd:0) from Half(1,3072) to Float(1,3072)

I don't know why. I already have set the output type and precision of all layer are float, besides these:

if network.get_layer(i).name.find("zeros_like/Const") != -1 \ or network.get_layer(i).name.find("NotEqual/y") != -1 \ or network.get_layer(i).name.find("const_fold_opt") != -1 \ or network.get_layer(i).name.find("Concat__") != -1 \ or network.get_layer(i).name.find("NotEqual__") != -1 \ or network.get_layer(i).type == trt.LayerType.CONCATENATION \ or network.get_layer(i).type == trt.LayerType.SHUFFLE \ or network.get_layer(i).type == trt.LayerType.IDENTITY: continue ...

these layers can't set output type and precision to float, because report error. such as the following:

INFO:root:layer name = [(Unnamed Layer 5) [Shuffle]], layer type = [LayerType.SHUFFLE] precision = [DataType.FLOAT] ... [TensorRT] ERROR: (Unnamed Layer 5) [Shuffle]: cannot use precision Float for layer that computes indices [TensorRT] ERROR: Layer (Unnamed Layer* 5) [Shuffle] failed validation

thanks.

ttyio commented 3 years ago

Hello @chenzhanyiczy

could you only use network.get_layer(i).type as filter condition? the network.get_layer(i).name seems risk to me, thanks!

chenzhanyiczy commented 3 years ago

@ttyio

could you only use network.get_layer(i).type as filter condition?

yes, like this:

if network.get_layer(i).name.find("NotEqual__") != -1 or network.get_layer(i).type == trt.LayerType.CONSTANT or network.get_layer(i).type == trt.LayerType.CONCATENATION or network.get_layer(i).type == trt.LayerType.SHUFFLE or network.get_layer(i).type == trt.LayerType.IDENTITY: continue ....

But this 'NotEqual__' cann't set with layer.Unary, because for example, erf function etc are also on this layer.

ttyio commented 3 years ago

@chenzhanyiczy Could you elaborate more on why we cannot force unary run on FP32 precision? thanks

chenzhanyiczy commented 3 years ago

@ttyio

Could you elaborate more on why we cannot force unary run on FP32 precision?

There seems to be no problem... let me take a look. Even if these are set, under fp16+strict_type+all layer(output + preicison), the result is still different from under fp32. And when building, it was all these warnings. In other words, may the fp16 operator is selected?

[TensorRT] WARNING: No implementation of layer text/bert/encoder/layer_2/attention/output/dense/MatMul + text/bert/encoder/layer_2/attention/output/dense/bias53 + (Unnamed Layer* 570) [Shuffle] + unsqueeze_node_after_text/bert/encoder/layer_2/attention/output/dense/bias53 + (Unnamed Layer 570) [Shuffle] + text/bert/encoder/layer_2/attention/output/dense/BiasAdd obeys the requested constraints in strict mode. No conforming implementation was found i.e. requested layer computation precision and output precision types are ignored, using the fastest implementation. [TensorRT] WARNING: No implementation of layer text/bert/encoder/layer_2/intermediate/dense/MatMul + text/bert/encoder/layer_2/intermediate/dense/bias__57 + (Unnamed Layer 604) [Shuffle] + unsqueeze_node_after_text/bert/encoder/layer_2/intermediate/dense/bias__57 + (Unnamed Layer 604) [Shuffle] + text/bert/encoder/layer_2/intermediate/dense/BiasAdd obeys the requested constraints in strict mode. No conforming implementation was found i.e. requested layer computation precision and output precision types are ignored, using the fastest implementation. [TensorRT] WARNING: No implementation of layer text/bert/encoder/layer_2/output/dense/MatMul + text/bert/encoder/layer_2/output/dense/bias__60 + (Unnamed Layer 630) [Shuffle] + unsqueeze_node_after_text/bert/encoder/layer_2/output/dense/bias__60 + (Unnamed Layer* 630) [Shuffle] + text/bert/encoder/layer_2/output/dense/BiasAdd obeys the requested constraints in strict mode. No conforming implementation was found i.e. requested layer computation precision and output precision types are ignored, using the fastest implementation.

Can you use trt to automatically build bert-base model? no matter how I set, there is always a big difference. Thank you.

ttyio commented 3 years ago

Hello @chenzhanyiczy , What the metric do you use to check the accuracy, do you have data like the SQuAD F1 value? Thanks

chenzhanyiczy commented 3 years ago

@ttyio

What the metric do you use to check the accuracy,

If it is the above example, then the following is the value of this output.

@ttyio ok. The following files are fp32 mode(default behavior) and fp16 mode+strict_type+all layer fp32(precision and output type). Thanks. build_fp32_layer2_output_LayerNorm_moments_variance.tar.gz build_fp16_layer2_output_LayerNorm_moments_variance.tar.gz

no matter how to set, fp16+strict_type+all layer(output + preicison) VS fp32, the result is always big different.

do you have data like the SQuAD F1 value?

We use bert to generate embedding, this involves algorithmic indicators, it’s not easy to say. :) Have you tried to automatically build a bert-base model with trt? I think it should be easier to reproduce. Thank you!

ttyio commented 3 years ago

Hello @chenzhanyiczy , I checked the internal TRT tests, and find the tolerance for TF bert is

          rtol=1e-3, atol=1.5

The model we used in our test has no nodes with name zeros_like, so there might some difference with yours. Have you tried train your model in FP16? thanks

chenzhanyiczy commented 3 years ago

@ttyio Can you share the code of how to trt build automatically? Is there any fp32 change to which layer? Thanks. We have a large tolerance under fp16(fp16+strict_type+all layer(output + preicison)),such as: rtol=1e-2 The zeros_like layer is only for padding, so should not affect the accuracy. Train in fp16 is more difficult...

ttyio commented 3 years ago

Hello @chenzhanyiczy The trt test is simple, just use polygraphy to run the network using trt fp16 and onnxruntime fp32, not cover any strict_type setting.

mdztravelling commented 2 years ago

Has this problem been solved? I have the same problem. The result of FP16 and FP32 is big different. I use trt 8.0.3 and 4 layers bert. @ttyio @chenzhanyiczy

mdztravelling commented 2 years ago

I modified skipln layer and use float32 dtype, the result different is smaller (< 0.0002). @chenzhanyiczy @ttyio

def skipln(prefix, config, init_dict, network, input_tensor, skip, is_last_skipln=False):
    """ 
    Add the skip layer
    """
    hidden_size = config.hidden_size
    #dtype = config.get_trt_dtype()
    dtype = trt.float32    # modify here
   ...
yushcs commented 2 years ago

same problem here, any suggestion?

zhaohb commented 2 years ago

@ttyio hi, I also want to achieve TRT mixing accuracy.

I added the following Settings:

'strict_types': trt.BuilderFlag.STRICT_TYPES,
'fp16': trt.BuilderFlag.FP16,

And added the following code, whether can realize the setting of mixing precision?

        for i in range(network.num_layers):
            if network.get_layer(i).type != trt.LayerType.FULLY_CONNECTED and network.get_layer(i).type != trt.LayerType.MATRIX_MULTIPLY and network.get_layer(i).type != trt.LayerType.SOFTMAX:
                network.get_layer(i).precision = trt.DataType.FLOAT
                network.get_layer(i).set_output_type(0, trt.DataType.FLOAT)

Unfortunately, I encountered this error:

......
[03/02/2022-07:41:14] [TRT] [E] [layers.h::setOutputType::1219] Error Code 3: API Usage Error (Parameter check failed at: /_src/build/cuda-11.4/8.2/x86_64/release/optimizer/api/layers.h::setOutputType::1219, condition: dataType == DataType::kINT32
)
[03/02/2022-07:41:14] [TRT] [E] [layers.h::setOutputType::1219] Error Code 3: API Usage Error (Parameter check failed at: /_src/build/cuda-11.4/8.2/x86_64/release/optimizer/api/layers.h::setOutputType::1219, condition: dataType == DataType::kINT32
......

I think it's because the output of op is DataType::kINT32, but I force change it to be DataType::FLOAT. How can this be avoided? thank you very much.

nvpohanh commented 2 years ago

@zhaohb In your case, don't call set_output_type if layer.get_output_type(0) returns kINT32.

@chenzhanyiczy Could you try TRT 8.2/8.4 and see if the issue still exists? If it does, we will debug it. Thanks

nvpohanh commented 2 years ago

Closing due to >14 days without activity. Please feel free to reopen if the issue still exists. Thanks

ArtemisZGL commented 1 year ago

I modified skipln layer and use float32 dtype, the result different is smaller (< 0.0002). @chenzhanyiczy @ttyio

def skipln(prefix, config, init_dict, network, input_tensor, skip, is_last_skipln=False):
    """ 
    Add the skip layer
    """
    hidden_size = config.hidden_size
    #dtype = config.get_trt_dtype()
    dtype = trt.float32    # modify here
   ...

hello, I met the same problem. Could you please explain what the skipln is and where to modify these code?