Closed dmenig closed 2 years ago
On docker container 21.09, GPU 3060, I see a small speedup :
Throughputs in qps
"Regular" 3d model :
FP16 : 47.1
INT8 : 56.4 (+19.6%)
This is still much smaller than excpected.
Note that I verified that the 3060 GPU showed no speedup on previous images.
On docker container 21.12, GPU 3060, I see a somewhat greater speedup, and also an overall faster model :
Throughputs in qps
"Regular" 3d model :
FP16 : 60.8
INT8 : 77.8 (+28.0%)
This is still much smaller than expected (~+90 to 100%). Still no speedup on GPUs with no tensorcores.
@hyperfraise Groups convs are not friendly to INT8 tensor cores so 28% performance gain looks reasonable to be. For INT8 to be 2x of FP16 in performance, you need number of channels (C and K) to be very large (say, >=1024). In group convs, the C and K for each group is usually only 32 or 64, which is not good for INT8 tensor cores.
If you are still interested in figuring out why the performance gain is so small, please share with me the verbose logs while the engines are being built. thanks
Hi. Thank you for your response. You are right that a grouped conv 2d models seems the better comparaison. Still, in 21.12, I see these results for torchvision.models.resnext101_32x8d
:
python3
import torch
import torchvision
dummy_input = torch.randn(8, 3, 224, 224).float().cuda()
model = torchvision.models.resnext101_32x8d().eval().cuda()
with torch.no_grad():
torch.onnx.export(
model,
dummy_input,
"resnet.onnx",
verbose=True,
)
On 2080 Ti
Throughputs in qps
"Resnext" 2d model :
FP16 : 1363.064
INT8 : 2498.776 (+83.321%)
On 1080 Ti:
Throughputs in qps
"Resnext" 2d model :
FP16 : 190.4616
INT8 : 374.3784 (+96.564%)
I argue it's still suspiciously high compared to what I have.
@hyperfraise Could you share trtexec logs with these flags? --verbose --profilingVerbosity=detailed --dumpLayerInfo --dumpProfile --separateProfileRun --noDataTransfers --useCudaGraph
This will enable very verbose logs that I can take a look. Thanks
Please provide these cases:
Here you go. Thank you in advance for your help !
I used resnext-101 for 3d and 2d, resnet-101 for 2d, and resnet18 for 3d because this is all I have access to. Please tell me if that bothers you for your analysis.
fp16_2d_resnet_building_logs.txt fp16_2d_resnext_building_logs.txt fp16_3d_resnet_building_logs.txt fp16_3d_resnext_building_logs.txt int8_2d_resnet_building_logs.txt int8_2d_resnext_building_logs.txt int8_3d_resnet_building_logs.txt int8_3d_resnext_building_logs.txt
In summary:
The reason why 3D ResNeXt doesn't get the same speed-up is because it spends ~69% of the e2e runtime in the first 7x7x7 Conv layer:
[05/23/2022-09:24:29] [I] Layer Time (ms) Avg. Time (ms) Time %
[05/23/2022-09:24:29] [I] Reformatting CopyNode for Input Tensor 0 to Conv_0 + Relu_1 26.94 0.4644 0.8
[05/23/2022-09:24:29] [I] Conv_0 + Relu_1 2283.06 39.3630 69.0
[05/23/2022-09:24:29] [I] MaxPool_2 104.35 1.7991 3.2
This layer will run in FP16 even when the INT8 is enabled because INT8 TensorCore kernels will have to pad the input channel from 3 to 32, which makes it slower than the FP16 TensorCore kernels. So you only get speed-up on other layers, which only take ~30% of the e2e runtime.
In comparison, the first conv layer does not take that much portion of the e2e runtime in the 2D cases.
I hope this explained your question.
The reason why 3D ResNeXt doesn't get the same speed-up is because it spends ~69% of the e2e runtime in the first 7x7x7 Conv layer:
[05/23/2022-09:24:29] [I] Layer Time (ms) Avg. Time (ms) Time % [05/23/2022-09:24:29] [I] Reformatting CopyNode for Input Tensor 0 to Conv_0 + Relu_1 26.94 0.4644 0.8 [05/23/2022-09:24:29] [I] Conv_0 + Relu_1 2283.06 39.3630 69.0 [05/23/2022-09:24:29] [I] MaxPool_2 104.35 1.7991 3.2
This layer will run in FP16 even when the INT8 is enabled because INT8 TensorCore kernels will have to pad the input channel from 3 to 32, which makes it slower than the FP16 TensorCore kernels. So you only get speed-up on other layers, which only take ~30% of the e2e runtime.
In comparison, the first conv layer does not take that much portion of the e2e runtime in the 2D cases.
I hope this explained your question.
Thanks for your response. I don't understand something : don't all networks listed here have to do this padding in INT8 ? Why is it only that slow for the 3D resnext ?
@hyperfraise Yes, the the first layer of the 2D networks also does not run in INT8, but its portion in terms of e2e runtime is not that large. Consider this: in 3D case, the first layer is a 7x7x7 conv, while in 2D case, the first layer is 7x7 conv.
(7x7x7) / (3x3x3) = 12.7x (7x7) / (3x3) = 5x
so you can see that the overhead of the first conv layer is much higher in 3D case than in 2D case.
Thank you for your help. This wasn't about TRT but about architecture afterall.
Description
I see no speedup between FP16 and INT8 on groupd convolution model.
Environment
TensorRT Version: 7.2.2.3 NVIDIA GPU: 2080 Ti (but I haven't seen any GPU that doesn't have the same problem, and I've truly tested most recent GPUs) NVIDIA Driver Version: 460.37 CUDA Version: 11.2 CUDNN Version: 8.1.0 Operating System: Ubuntu 20.04 Python Version (if applicable): 3.8 PyTorch Version (if applicable): 1.8.1
Relevant Files
You can use a NVCR TensorRT container
Get the repository for the group convolutions 3D model
Export a simple resnext model
Steps To Reproduce
Use the script above to generate a onnx model. Then optimize it with this command :
Speedtest results are the following :
So the optimization for INT8 quantization work for regular 3d convolutions but not for this specific model.
INT8 expected speedup on any model should be about *2, am I right ?
I believe that INT8 quantization of this kind of convolution might still not be supported in TensorRT, according to https://forums.developer.nvidia.com/t/does-tensorrt-support-conv3d-with-tensor-core/113193/9 If this is what is causing the problem, could you open a feature request so that this is solved when you guys have the time, please ?