Int8 mode is slower than fp16

ye1024 commented 3 years ago

Hi, I took out the token embedding layer in Bert and built tensorrt engine to test the inference effect of int8 mode, but found that int8 mode is slower than fp16； i use nvprof to view the GPU consumption of the two modes, as follows:

fp16: GPU activities: 99.87% 22.158ms 6 3.6930ms 1.7280us 22.148ms [CUDA memcpy HtoD] 0.06% 13.376us 8 1.6720us 1.6000us 1.9520us [CUDA memset] 0.05% 10.688us 1 10.688us 10.688us 10.688us void cuGatherLayer::gatherGeneric<float, int=32>(void, cuGatherLayer::StrideArray, cuGatherLayer::gatherGeneric<float, int=32>, void, int, void, cuGatherLayer::ShapeArray, int, int, int, cuGatherLayer::ReducedDivisorArray, int, int, int, int, cuGatherLayer::CoefficientData, cuGatherLayer::CoefficientIndices) 0.02% 4.1600us 1 4.1600us 4.1600us 4.1600us [CUDA memcpy DtoH] 0.01% 1.6320us 1 1.6320us 1.6320us 1.6320us [CUDA memcpy DtoD]

int8: GPU activities: 99.84% 20.210ms 6 3.3683ms 1.6950us 20.201ms [CUDA memcpy HtoD] 0.07% 13.536us 8 1.6920us 1.6000us 1.9840us [CUDA memset] 0.07% 13.311us 1 13.311us 13.311us 13.311us void cuGatherLayer::gatherAxisZeroPartition<float, int=64, int=256>(void, cuGatherLayer::StrideArray, cuGatherLayer::gatherAxisZeroPartition<float, int=64, int=256>, void, int, void, cuGatherLayer::ShapeArray, int, int, int, cuGatherLayer::ReducedDivisorArray, cuGatherLayer::ShapeArray, cuGatherLayer::ShapeArray, int, int, int, int, int, int, nvinfer1::rt::reduced_divisor) 0.02% 3.7120us 1 3.7120us 3.7120us 3.7120us [CUDA memcpy DtoH] 0.01% 1.7280us 1 1.7280us 1.7280us 1.7280us [CUDA memcpy DtoD]

I want to know if there is something wrong with int8 quantization. Thanks!

TensorRT Version: 6.0.1.5 GPU Type: V100 Nvidia Driver Version: 418.39 CUDA Version: 10.1 Operating System: ubuntu18.04

ttyio commented 3 years ago

Hello @ye1024 , thanks for reporting. V100 has native INT8 and FP16, but the tensorcore in this chip only support FP16, so we donot always get benefit in INT8. Like this public BERT perf on V100 https://github.com/NVIDIA/TensorRT/tree/master/demo/BERT#inference-performance-nvidia-v100-16gb, FP16 is slight faster.

So I suggest enable both FP16 and INT8 in the builderConfig to allow mix precision for V100. And let TRT to select best kernel for your model. Also try RTX GPUs with INT8 tensorCores if possible.

ye1024 commented 3 years ago

Hello @ye1024 , thanks for reporting. V100 has native INT8 and FP16, but the tensorcore in this chip only support FP16, so we donot always get benefit in INT8. Like this public BERT perf on V100 https://github.com/NVIDIA/TensorRT/tree/master/demo/BERT#inference-performance-nvidia-v100-16gb, FP16 is slight faster.

So I suggest enable both FP16 and INT8 in the builderConfig to allow mix precision for V100. And let TRT to select best kernel for your model. Also try RTX GPUs with INT8 tensorCores if possible.

Thanks for the reply.

I will try to set the mixing precision. In addition, I want to know whether the tensorcore of T4 supports int8?

ttyio commented 3 years ago

@ye1024 Yes, T4 supports int8 tensor core.

ye1024 commented 3 years ago

@ye1024 Yes, T4 supports int8 tensor core.

Thank you for your answer. It's very helpful to me. I want to know if we don't do int8 quantization training and use calibrator to do int8 quantization, will it slow down the inference speed or only lose the accuracy?

Bonsen commented 3 years ago

@ye1024 I also have this question! I create calibrator in python, and then set calibrator in config of C++. Do u know it? @ttyio

ttyio commented 3 years ago

Hello @ye1024 @Bonsen , sorry I missed this thread, For PTQ we will quantize all layer outputs by default, and need user manually set mixed precision when they found accuracy drop. We can think PTQ has consider perf only by default.

For QAT, we will only quantize the input of layers that known can be quantized in most cases, so we can think QAT has considered both accuracy and perf.

However it is not a must that PTQ has lower accuracy than QAT in all cases. So my suggestion is that if the find-tuning is not hard, let's try both QAT and PTQ, and choose the better one.

Bonsen commented 3 years ago

THANKS

twmht commented 3 years ago

@ttyio

I also found out QAT is slower than fp16 and PTQ.

so how to force QAT to quantize all layer's output? Why we can get better speed when quantize all layer's output? for example, say that we have the following graph

conv1->relu1->conv2

For QAT we only quantize the input of conv1 and conv2, but PTQ quantize the input of relu1 also (that is from conv1's output)?

ttyio commented 3 years ago

@twmht , for the simple conv->relu->conv pattern, there is no difference between PTQ and QAT, the difference come from some complex pattern, e.g, residual for resnet, see https://docs.nvidia.com/deeplearning/tensorrt/pytorch-quantization-toolkit/docs/tutorials/quant_resnet50.html#further-optimization

twmht commented 3 years ago

@ttyio

so if we have conv->relu, and the residual input

(conv->relu) + residual, according to the document you ref, do we need to quantize the output of relu to fuse the elementwise operator? that is, quantize(conv)->quantize(relu) + quantize (residual) ?

Here the fused operator’s output precision must match the residual input precision

what if the precision don't match? what would tensorrt do?

twmht commented 3 years ago

@ttyio I also found out if my graph activation function is leaky relu.

like conv1->leaky_relu->conv2, tensorrt would say conv2 is missing quantization data.

tensorrt log would print following messages

[TensorRT] WARNING: Rejecting some int8 implementation of layer Conv_2 due to missing int8 scales for tensor 427 at output index 0
[TensorRT] VERBOSE: Rejecting some int8 implementation of layer LeakyRelu_3 due to missing int8 scales for tensor 427 at input index 0
[TensorRT] VERBOSE: Rejecting some int8 implementation of layer Conv_6 due to missing int8 scales for tensor 296 at input index 0
[TensorRT] VERBOSE: Rejecting some int8 implementation of layer LeakyRelu_7 due to missing int8 scales for tensor 430 at input index 0
[TensorRT] VERBOSE: Rejecting some int8 implementation of layer Conv_10 due to missing int8 scales for tensor 301 at input index 0
[TensorRT] VERBOSE: Rejecting some int8 implementation of layer LeakyRelu_11 due to missing int8 scales for tensor 433 at input index 0
[TensorRT] WARNING: Rejecting some int8 implementation of layer Conv_14 due to missing int8 scales for tensor 436 at output index 0
[TensorRT] VERBOSE: Rejecting some int8 implementation of layer LeakyRelu_15 due to missing int8 scales for tensor 436 at input index 0
[TensorRT] VERBOSE: Rejecting some int8 implementation of layer Conv_18 due to missing int8 scales for tensor 311 at input index 0
[TensorRT] VERBOSE: Rejecting some int8 implementation of layer LeakyRelu_19 due to missing int8 scales for tensor 439 at input index 0
[TensorRT] VERBOSE: Rejecting some int8 implementation of layer Conv_22 due to missing int8 scales for tensor 316 at input index 0
[TensorRT] VERBOSE: Rejecting some int8 implementation of layer LeakyRelu_23 due to missing int8 scales for tensor 442 at input index 0
[TensorRT] WARNING: Rejecting some int8 implementation of layer Conv_26 due to missing int8 scales for tensor 445 at output index 0
[TensorRT] VERBOSE: Rejecting some int8 implementation of layer LeakyRelu_27 due to missing int8 scales for tensor 445 at input index 0
[TensorRT] VERBOSE: Rejecting some int8 implementation of layer Conv_30 due to missing int8 scales for tensor 326 at input index 0
[TensorRT] VERBOSE: Rejecting some int8 implementation of layer LeakyRelu_31 due to missing int8 scales for tensor 448 at input index 0
[TensorRT] VERBOSE: Rejecting some int8 implementation of layer Conv_34 due to missing int8 scales for tensor 331 at input index 0
[TensorRT] VERBOSE: Rejecting some int8 implementation of layer LeakyRelu_35 due to missing int8 scales for tensor 451 at input index 0
[TensorRT] WARNING: Rejecting some int8 implementation of layer Conv_38 due to missing int8 scales for tensor 454 at output index 0
[TensorRT] VERBOSE: Rejecting some int8 implementation of layer LeakyRelu_39 due to missing int8 scales for tensor 454 at input index 0
[TensorRT] VERBOSE: Rejecting some int8 implementation of layer Conv_42 due to missing int8 scales for tensor 341 at input index 0
[TensorRT] VERBOSE: Rejecting some int8 implementation of layer Conv_46 due to missing int8 scales for tensor 346 at input index 0
[TensorRT] VERBOSE: Rejecting some int8 implementation of layer LeakyRelu_47 due to missing int8 scales for tensor 460 at input index 0
[TensorRT] WARNING: Rejecting some int8 implementation of layer Conv_50 due to missing int8 scales for tensor 463 at output index 0
[TensorRT] VERBOSE: Rejecting some int8 implementation of layer LeakyRelu_51 due to missing int8 scales for tensor 463 at input index 0
[TensorRT] VERBOSE: Rejecting some int8 implementation of layer Conv_54 due to missing int8 scales for tensor 356 at input index 0
[TensorRT] VERBOSE: Rejecting some int8 implementation of layer LeakyRelu_55 due to missing int8 scales for tensor 466 at input index 0
[TensorRT] VERBOSE: Rejecting some int8 implementation of layer Conv_58 due to missing int8 scales for tensor 361 at input index 0
[TensorRT] VERBOSE: Rejecting some int8 implementation of layer LeakyRelu_59 due to missing int8 scales for tensor 469 at input index 0
[TensorRT] WARNING: Rejecting some int8 implementation of layer Conv_62 due to missing int8 scales for tensor 472 at output index 0
[TensorRT] VERBOSE: Rejecting some int8 implementation of layer LeakyRelu_63 due to missing int8 scales for tensor 472 at input index 0
[TensorRT] VERBOSE: Rejecting some int8 implementation of layer Conv_66 due to missing int8 scales for tensor 371 at input index 0
[TensorRT] VERBOSE: Rejecting some int8 implementation of layer LeakyRelu_67 due to missing int8 scales for tensor 475 at input index 0
[TensorRT] VERBOSE: Rejecting some int8 implementation of layer Conv_70 due to missing int8 scales for tensor 376 at input index 0
[TensorRT] VERBOSE: Rejecting some int8 implementation of layer LeakyRelu_71 due to missing int8 scales for tensor 478 at input index 0
[TensorRT] WARNING: Rejecting some int8 implementation of layer Conv_74 due to missing int8 scales for tensor 481 at output index 0
[TensorRT] VERBOSE: Rejecting some int8 implementation of layer LeakyRelu_75 due to missing int8 scales for tensor 481 at input index 0
[TensorRT] VERBOSE: Rejecting some int8 implementation of layer Conv_78 due to missing int8 scales for tensor 386 at input index 0
[TensorRT] VERBOSE: Rejecting some int8 implementation of layer LeakyRelu_79 due to missing int8 scales for tensor 484 at input index 0
[TensorRT] VERBOSE: Rejecting some int8 implementation of layer Conv_82 due to missing int8 scales for tensor 391 at input index 0
[TensorRT] VERBOSE: Rejecting some int8 implementation of layer LeakyRelu_83 due to missing int8 scales for tensor 487 at input index 0
[TensorRT] WARNING: Rejecting some int8 implementation of layer Conv_86 due to missing int8 scales for tensor 490 at output index 0
[TensorRT] VERBOSE: Rejecting some int8 implementation of layer LeakyRelu_87 due to missing int8 scales for tensor 490 at input index 0
[TensorRT] VERBOSE: Rejecting some int8 implementation of layer Conv_90 due to missing int8 scales for tensor 401 at input index 0
[TensorRT] VERBOSE: Rejecting some int8 implementation of layer Conv_94 due to missing int8 scales for tensor 406 at input index 0
[TensorRT] VERBOSE: Rejecting some int8 implementation of layer LeakyRelu_95 due to missing int8 scales for tensor 496 at input index 0
[TensorRT] WARNING: Rejecting some int8 implementation of layer Conv_98 due to missing int8 scales for tensor 499 at output index 0
[TensorRT] VERBOSE: Rejecting some int8 implementation of layer LeakyRelu_99 due to missing int8 scales for tensor 499 at input index 0
[TensorRT] VERBOSE: Rejecting some int8 implementation of layer Conv_102 due to missing int8 scales for tensor 416 at input index 0
[TensorRT] VERBOSE: Rejecting some int8 implementation of layer LeakyRelu_103 due to missing int8 scales for tensor 502 at input index 0
[TensorRT] VERBOSE: Rejecting some int8 implementation of layer Conv_106 due to missing int8 scales for tensor 421 at input index 0
[TensorRT] VERBOSE: *************** Autotuning Reformat:Float(695040,231680,362,1) -> Float(695040,1,1086,3) ***************

and the tensorrs for feeding to leaky_relu would print the following messages

[TensorRT] WARNING: Missing scale and zero-point for tensor 427, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[TensorRT] WARNING: Missing scale and zero-point for tensor 296, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[TensorRT] WARNING: Missing scale and zero-point for tensor 430, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[TensorRT] WARNING: Missing scale and zero-point for tensor 301, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[TensorRT] WARNING: Missing scale and zero-point for tensor 433, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[TensorRT] WARNING: Missing scale and zero-point for tensor 436, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[TensorRT] WARNING: Missing scale and zero-point for tensor 311, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
...

ttyio commented 3 years ago

@twmht for the first question, no need to quantize the relu here.

for the second question, when we have quant(conv1) + relu, it means that we will have conv1 + relu fused together and the precision is int8-in-fp32-out, only when there is another successor conv that has quantization, then we will have conv1 +relu have the int8-in-int8-out. e.g, quant(conv1) + relu + quant(conv2) will have conv1 run in int8-in-int8-out. this is the reason why the conv2 here cannot run in int8-in-int8-out.

twmht commented 3 years ago

@ttyio

for the leaky_relu case I post, i did not use any residual connection, it's a plain mobilenetv1 but with leaky_relu.

if i change leaky_relu to relu6, the messages would be gone. does this mean that I have to quantize leaky_relu here, since tensorrt won't fuse them?

ttyio commented 3 years ago

@twmht ,

So there are quantization node after the leaky_relu node according to your description.

For conv+relu we have static fusion kernel, and for conv+leaky_relu, the fused kernel is generated on-the-fly. When the fusion failed, the conv can only run in int8->in->fp32->out, so you see the warning, you can get rid of the warning when you add quantization to leaky_relu.

What's the cuda version and trt version here?

twmht commented 3 years ago

@ttyio

I use jetpack 4.6 (https://developer.nvidia.com/embedded/jetpack) on jetson nx.

Let me clarify my question,

I have quant(conv1)->leaky_relu->quant(conv2), and such graph is even slower than fp16. But it's not reasonable, since conv1 and conv2 should run on int8 implementation. but according to the log, conv2 would not run on int8 implementation. are there some rules of post training quantization we need to follow or it's a bug?

My expected would be following

fp32 input->quat int8-> conv1->fp32->leaky_relu->fp32->quant int8->conv2

which part is wrong?

ttyio commented 3 years ago

@twmht ,

so the question is not for warning after replace the relu with leaky_relu (it is mainly because there is runtime fusion not available in CUDA 10.2). The question is for the conv2. right?

could you enable the verbose log and send it here, I want to check what's the precision for conv2, it should run in either int8-in-fp32-out or int8-in-int8-out depends on the successor nodes. thanks!

twmht commented 3 years ago

@ttyio

Here (https://forums.developer.nvidia.com/t/post-quantization-aware-training-is-slower-than-fp16-and-post-quantization/190019/6) is the full onnx and full log from trtexec

the network above is from here (https://github.com/biubug6/Pytorch_Retinaface)

now i tried to narrow the problem by only export the backbone (mobilent0.25)

here (https://drive.google.com/drive/folders/1F6j1CiXDFYS87-7U7cVDz40FzopKnBRV?usp=sharing) are the backbone-only onnx and trtexec log with verbose

you can find that the qps from fp16 is double of int8

where epoch_15_leaky.onnx is the quantized version of onnx, , and epoch_250.onnx is non-quantized version of onnx.

and what confuse me is what is the performance difference between int8-in-fp32-out and int8-in-int8-out? from my opinion most of the computations would be saved by int8 convolution, does it matter fp32-out or int8-out?

ttyio commented 3 years ago

@twmht , I see, we selected the int8-in-fp32-out kernel, this is expected because QAT is an explicit precision mode, and when you apply the quantization to the conv, then they have to run in the required precision even it is slower.

the main perf gap come from the reformat layer that try to convert between fp32 and int8 ( you can find a lot of Layer(Reformat) in the QAT int8 log), when there is int8-in-int8-out, the output layout of previous kernel match the input layout of next kernel, but we have to insert extra reformat when it is int8-in-fp32-out.

you can check the layerwise time by adding --seprateProfileRun --dumpProfile in the trtexec.

twmht commented 3 years ago

@ttyio

when you apply the quantization to the conv, then they have to run in the required precision even it is slower.

so you mean that sometimes quantized convolution may be slower than non-quantized version? from log where did you know it's int8-in-fp32-out ?

is from something like this?

[09/27/2021-14:45:56] [V] [TRT] DequantizeLinear_53 [DequantizeLinear] outputs: [465 -> (1, 16, 160, 91)[FLOAT]],

ttyio commented 3 years ago

@twmht

Grep the Engine Layer Information: in your log, and after that you can find:

Layer(CaskConvolution): body.stage1.1.3.weight + QuantizeLinear_31_quantize_scale_node + Conv_33, Tactic: -3908975881807046106, Reformatted Input Tensor 0 to body.stage1.1.3.weight + QuantizeLinear_31_quantize_scale_node + Conv_33[Int8(1,8,320,181)] -> 444[Float(1,16,320,181)]

so you mean that sometimes quantized convolution may be slower than non-quantized version?

Yes, and in your case the limitation mainly come from the CUDA version, we can do the runtime fusion for conv and leakyrelu after CUDA 11.0, and if they are fused together, then this is a int8-in-int8-out kernel, no extra reformat.

twmht commented 3 years ago

@ttyio

thank you. I have tried to quantize every leaky_relu layer through TensorRT API. and see if any reformating node is inserted. The good news is most of conv is int8-in and int8-out except the last output node. the bad news is this version is still slower than fp16. is this due to explicit precision? how to let tensorRT selects best precision to run on?

Here are how I use TensorRT API to set the dynamic ranges for every input tensor. Please point if there is any mistake I made.

    with open(args.calib_table, 'rb') as f:
        calib_config = pickle.load(f)
        print(calib_config.keys())
        print(len(calib_config.keys()))
    for i in range(network.num_layers):
        layer = network.get_layer(i)
        if layer.name not in calib_config:
            #  print(f'not found and set !!!!!!!!!!!!!{layer.name}')
            if 'Relu' in layer.name:
                layer.precision = trt.int8
                in_tensor = layer.get_input(0)
                in_tensor.dynamic_range = (-5, 5)
            continue
        #  print(f'amax = {calib_config[layer.name]}, name={layer.name}')
        amax = calib_config[layer.name]
        layer.precision = trt.int8
        in_tensor = layer.get_input(0)
        in_tensor.dynamic_range = (-amax, amax)
        print(f'set {layer.name} to ({-amax}, {amax})')

twmht commented 3 years ago

ok. I may find something useful (https://docs.nvidia.com/deeplearning/tensorrt/best-practices/index.html#optimize-performance)

twmht commented 3 years ago

@ttyio by the way, I am wondering why TensorRT make the pytorch-quantization toolkit (https://docs.nvidia.com/deeplearning/tensorrt/pytorch-quantization-toolkit/docs/userguide.html) . According to the rules of optimization, if the succeeding node of conv can't be fused into conv, then the conv would be fp32-out. For PTQ, you have quantized every layer, thus there is not need to worry about the reformatting node.

So why not release the PTQ as a toolkit, and finetune the model based on PTQ's calibration data?

ttyio commented 3 years ago

@twmht , the code is right, but for QAT, we change the code in pytorch to allow the change also seen during the fine-tune, so that we can get better accuracy. you can still refer to the resnet example.

For the question why we introduce QAT, this is for accuracy consideration, when PTQ not meets your accuracy requirement, then you need try QAT, we have a white paper explains more details http://arxiv.org/abs/2004.09602

WeixiangXu commented 1 year ago

@ttyio

For PTQ, you have quantized every layer, thus there is not need to worry about the reformatting node. So why not release the PTQ as a toolkit, and finetune the model based on PTQ's calibration data?

Hi ttyio, I also have the same problem. In PTQ, qdq nodes are inserted to all layers, so that we do not have to worry about the precision reformat. However, in QAT, qdq nodes are only inserted to specific layers like conv and linear. Can we do QAT by inserting qdq nodes to all layers (just like PTQ)?

ttyio commented 1 year ago

@ttyio

For PTQ, you have quantized every layer, thus there is not need to worry about the reformatting node. So why not release the PTQ as a toolkit, and finetune the model based on PTQ's calibration data?

Hi ttyio, I also have the same problem. In PTQ, qdq nodes are inserted to all layers, so that we do not have to worry about the precision reformat. However, in QAT, qdq nodes are only inserted to specific layers like conv and linear. Can we do QAT by inserting qdq nodes to all layers (just like PTQ)?

Hi @WeixiangXu , The answer is TRT rely the Q/DQ node placement for explicit precision network. To understand this, I will explain 4 different concepts:

PTQ vs. QAT, here the PTQ means calibration without fine-tuning, and QAT means both calibration and fine-tuning. pytorch-quantization-toolkit support both PTQ and QAT. TensorRT calibration support PTQ only.
implicit precision and explicit precision, networks using Q/DQ node are called explicit precision network, networks using setDynamicRange are called implicit precision network. for explicit precision network, we have rules to fusion layers because we need keep accuracy the same as in the framework. And for implicit precision, we only optimize the performance, to keep accuracy, user need explicit call layer.precision and layer.setOutputType.

So the TRT calibration, actually we are using PTQ + implicit precision. And for pytorch-quantization-toolkit, you are either using PTQ + explicit precision or QAT + explicit precision. And we rely on the Q/DQ placement because of the explicit precision.

NVIDIA / TensorRT

Int8 mode is slower than fp16 #993