NVIDIA / TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
https://developer.nvidia.com/tensorrt
Apache License 2.0
10.52k stars 2.1k forks source link

No speedup in INT8 on 3D architecture with grouped convolutions #1409

Closed dmenig closed 2 years ago

dmenig commented 3 years ago

Description

I see no speedup between FP16 and INT8 on groupd convolution model.

Environment

TensorRT Version: 7.2.2.3 NVIDIA GPU: 2080 Ti (but I haven't seen any GPU that doesn't have the same problem, and I've truly tested most recent GPUs) NVIDIA Driver Version: 460.37 CUDA Version: 11.2 CUDNN Version: 8.1.0 Operating System: Ubuntu 20.04 Python Version (if applicable): 3.8 PyTorch Version (if applicable): 1.8.1

Relevant Files

You can use a NVCR TensorRT container

nvidia-docker run -it nvcr.io/nvidia/tensorrt:21.03-py3 /bin/bash

Get the repository for the group convolutions 3D model

git clone https://github.com/kenshohara/3D-ResNets-PyTorch.git
cd 3D-ResNets-PyTorch 
git checkout 35640f358c05904ab53816d8da0f2d968b4b3038 # current master is broken
pip install torch torchvision sklearn

Export a simple resnext model

python3
import torch
import torchvision
from models import resnext

dummy_input = torch.randn(8, 3, 35, 224, 224).float().cuda()

## Regular 3d model
model = torchvision.models.video.r2plus1d_18().eval().cuda()

## Resnext 3d model
model = resnext.resnet101(sample_size=224, sample_duration=35).eval().cuda()

with torch.no_grad():
    torch.onnx.export(
        model,
        dummy_input,
        "resnet.onnx",
        verbose=True,
    )

Steps To Reproduce

Use the script above to generate a onnx model. Then optimize it with this command :

# FP16  optimization : 
/usr/src/tensorrt/bin/trtexec --onnx=resnet.onnx --fp16 --workspace=5000 --saveEngine=resnet.trt --inputIOFormats=fp32:chw --outputIOFormats=fp32:chw

# INT8 (quantization) optimization : 
/usr/src/tensorrt/bin/trtexec --onnx=resnet.onnx --best --workspace=5000 --saveEngine=resnet.trt --inputIOFormats=fp32:chw --allowGPUFallback --outputIOFormats=fp32:chw

Speedtest results are the following :

GPU latencies in ms
"Regular" 3d model :
FP16 : 17.1
INT8 : 9.28

Resnext 3d model :
FP16 : 11.4
INT8 : 11.5

So the optimization for INT8 quantization work for regular 3d convolutions but not for this specific model.

INT8 expected speedup on any model should be about *2, am I right ?

I believe that INT8 quantization of this kind of convolution might still not be supported in TensorRT, according to https://forums.developer.nvidia.com/t/does-tensorrt-support-conv3d-with-tensor-core/113193/9 If this is what is causing the problem, could you open a feature request so that this is solved when you guys have the time, please ?

dmenig commented 2 years ago

On docker container 21.09, GPU 3060, I see a small speedup :

Throughputs in qps
"Regular" 3d model :
FP16 : 47.1
INT8 : 56.4 (+19.6%)

This is still much smaller than excpected.

Note that I verified that the 3060 GPU showed no speedup on previous images.

dmenig commented 2 years ago

On docker container 21.12, GPU 3060, I see a somewhat greater speedup, and also an overall faster model :

Throughputs in qps
"Regular" 3d model :
FP16 : 60.8
INT8 : 77.8 (+28.0%)

This is still much smaller than expected (~+90 to 100%). Still no speedup on GPUs with no tensorcores.

nvpohanh commented 2 years ago

@hyperfraise Groups convs are not friendly to INT8 tensor cores so 28% performance gain looks reasonable to be. For INT8 to be 2x of FP16 in performance, you need number of channels (C and K) to be very large (say, >=1024). In group convs, the C and K for each group is usually only 32 or 64, which is not good for INT8 tensor cores.

If you are still interested in figuring out why the performance gain is so small, please share with me the verbose logs while the engines are being built. thanks

dmenig commented 2 years ago

Hi. Thank you for your response. You are right that a grouped conv 2d models seems the better comparaison. Still, in 21.12, I see these results for torchvision.models.resnext101_32x8d:

python3
import torch
import torchvision

dummy_input = torch.randn(8, 3, 224, 224).float().cuda()

model = torchvision.models.resnext101_32x8d().eval().cuda()

with torch.no_grad():
    torch.onnx.export(
        model,
        dummy_input,
        "resnet.onnx",
        verbose=True,
    )

On 2080 Ti

Throughputs in qps
"Resnext" 2d model :
FP16 : 1363.064
INT8 : 2498.776 (+83.321%)

On 1080 Ti:

Throughputs in qps
"Resnext" 2d model :
FP16 : 190.4616
INT8 : 374.3784 (+96.564%)

I argue it's still suspiciously high compared to what I have.

nvpohanh commented 2 years ago

@hyperfraise Could you share trtexec logs with these flags? --verbose --profilingVerbosity=detailed --dumpLayerInfo --dumpProfile --separateProfileRun --noDataTransfers --useCudaGraph This will enable very verbose logs that I can take a look. Thanks

Please provide these cases:

dmenig commented 2 years ago

Here you go. Thank you in advance for your help !

I used resnext-101 for 3d and 2d, resnet-101 for 2d, and resnet18 for 3d because this is all I have access to. Please tell me if that bothers you for your analysis.

fp16_2d_resnet_building_logs.txt fp16_2d_resnext_building_logs.txt fp16_3d_resnet_building_logs.txt fp16_3d_resnext_building_logs.txt int8_2d_resnet_building_logs.txt int8_2d_resnext_building_logs.txt int8_3d_resnet_building_logs.txt int8_3d_resnext_building_logs.txt

nvpohanh commented 2 years ago

In summary:

nvpohanh commented 2 years ago

The reason why 3D ResNeXt doesn't get the same speed-up is because it spends ~69% of the e2e runtime in the first 7x7x7 Conv layer:

[05/23/2022-09:24:29] [I]                                                                        Layer   Time (ms)   Avg. Time (ms)   Time %
[05/23/2022-09:24:29] [I]                  Reformatting CopyNode for Input Tensor 0 to Conv_0 + Relu_1       26.94           0.4644      0.8
[05/23/2022-09:24:29] [I]                                                              Conv_0 + Relu_1     2283.06          39.3630     69.0
[05/23/2022-09:24:29] [I]                                                                    MaxPool_2      104.35           1.7991      3.2

This layer will run in FP16 even when the INT8 is enabled because INT8 TensorCore kernels will have to pad the input channel from 3 to 32, which makes it slower than the FP16 TensorCore kernels. So you only get speed-up on other layers, which only take ~30% of the e2e runtime.

In comparison, the first conv layer does not take that much portion of the e2e runtime in the 2D cases.

I hope this explained your question.

dmenig commented 2 years ago

The reason why 3D ResNeXt doesn't get the same speed-up is because it spends ~69% of the e2e runtime in the first 7x7x7 Conv layer:

[05/23/2022-09:24:29] [I]                                                                        Layer   Time (ms)   Avg. Time (ms)   Time %
[05/23/2022-09:24:29] [I]                  Reformatting CopyNode for Input Tensor 0 to Conv_0 + Relu_1       26.94           0.4644      0.8
[05/23/2022-09:24:29] [I]                                                              Conv_0 + Relu_1     2283.06          39.3630     69.0
[05/23/2022-09:24:29] [I]                                                                    MaxPool_2      104.35           1.7991      3.2

This layer will run in FP16 even when the INT8 is enabled because INT8 TensorCore kernels will have to pad the input channel from 3 to 32, which makes it slower than the FP16 TensorCore kernels. So you only get speed-up on other layers, which only take ~30% of the e2e runtime.

In comparison, the first conv layer does not take that much portion of the e2e runtime in the 2D cases.

I hope this explained your question.

Thanks for your response. I don't understand something : don't all networks listed here have to do this padding in INT8 ? Why is it only that slow for the 3D resnext ?

nvpohanh commented 2 years ago

@hyperfraise Yes, the the first layer of the 2D networks also does not run in INT8, but its portion in terms of e2e runtime is not that large. Consider this: in 3D case, the first layer is a 7x7x7 conv, while in 2D case, the first layer is 7x7 conv.

(7x7x7) / (3x3x3) = 12.7x (7x7) / (3x3) = 5x

so you can see that the overhead of the first conv layer is much higher in 3D case than in 2D case.

dmenig commented 2 years ago

Thank you for your help. This wasn't about TRT but about architecture afterall.