3D layers show no speedup in INT8 on cuda cores

dmenig commented 3 years ago

Hi. I'm not able to quantize 3d convolution layers. Is there any plan to add support for 3d layers to TensorRT quantization ?

ttyio commented 3 years ago

Hello @hyperfraise , are you calibration in TRT, or are you using nvidia-pytorch-quantization tools? could you elaborate what's the failure you hit when try quantize 3d conv?

dmenig commented 3 years ago

3d layers are in fact not supported by tensorrt in Int8 precision by design right now. I don't think there is much more to detail for this issue than asking when it will be available https://docs.nvidia.com/deeplearning/tensorrt/support-matrix/index.html#layers-precision-matrix

Or are you saying you guys aren't actually developping TensorRT quantization tools, but instead nvidia-pytorch's ?

ttyio commented 3 years ago

@hyperfraise oops, this doc seems out of date, we support int8 3d conv kernels in TRT7.x. Have you hit any issue? And I will ask for the documentation updates, thanks.

We owns/develops both quantization in TRT and nvidia-pytorch.

dmenig commented 3 years ago

Ho ok. Well then my results are pretty weird.

Nvidia driver : 460.39 OS : Ubuntu 20.04 GPU : 2080 Ti

I' optimizing 3d and 2d resnets to show you this weird discrepancy :

import torch
import torchvision

## 2d code
dummy_input = torch.randn(8, 3, 224, 224).float().cuda()
model = torchvision.models.resnet101().cuda().eval()

## 3d code
# model = torchvision.models.video.r2plus1d_18().cuda().eval()
# dummy_input = torch.randn(8, 3, 35, 224, 224).float().cuda()

with torch.no_grad():
    torch.onnx.export(
        model,
        dummy_input,
        "resnet.onnx",
        verbose=True,
    )

Then I optimize those models with different versions of tensorrt and see the speedup. My commands are the following :

# FP32  optimization : 
/usr/src/tensorrt/bin/trtexec --onnx=resnet.onnx --fp32 --workspace=5000 --saveEngine=resnet.trt --inputIOFormats=fp32:chw --outputIOFormats=fp32:chw

# FP16  optimization : 
/usr/src/tensorrt/bin/trtexec --onnx=resnet.onnx --fp16 --workspace=5000 --saveEngine=resnet.trt --inputIOFormats=fp32:chw --outputIOFormats=fp32:chw

# INT8 (quantization) optimization : 
/usr/src/tensorrt/bin/trtexec --onnx=resnet.onnx --best --workspace=5000 --saveEngine=resnet.trt --inputIOFormats=fp32:chw --allowGPUFallback --outputIOFormats=fp32:chw

And then I do a speed test in python. Here are my results (the numbers are spl/s at the size above) on 2080 Ti

On tensorrt 7.1.2 (docker image 20.06 on nvcr https://docs.nvidia.com/deeplearning/tensorrt/container-release-notes/rel_20-06.html#rel_20-06)

          INT8    FP16    FP32
2d        3610    2050    640
3d        11.0    11.0    7.05

And here are the sizes of the .trt model saved in MB :

          INT8    FP16    FP32
2d        87      86      295
3d        81      81      121

In addition, in the logs of the optimization, I see that some "i8" kinda confiigurations are tested in both cases, but never selected for the 3d models, as if they didn't bring any kind of speedup ?

It seems to me like tensorrt 7.x INT8 brings no speed improvement to 3d convolution on 2080 ti, which leads me to believe that either quantization doesn't happen or it doesn't actually bring a speedup ? Please tell me if I did something wrong.

dmenig commented 3 years ago

Ok. I noticed a proper speedup on Tensorrt 7.2.2.3 (available with 21.03 container) on 2080 ti and Titan RTX but not on any other GPU. I tested 2070, 1080 Ti, 1660 Super : no speedup compared to FP16 with the same docker container. What do you think is happening ?

PS : I tested tvm quantization and noticed a speedup similar compared to FP16 between all those GPUs, so it seems weird that TensorRT wouldn't provide this speedup on all GPUs.

ttyio commented 3 years ago

Hello @hyperfraise

Sorry typo in my previous comment, we add int8 3d conv support in TRT 7.2.x.

For 1080Ti and 1660 Super, there is no INT8 tensorcore, that explains why it is not speeded up.

Do you have data for the perf result on 2070 ? thanks.

dmenig commented 3 years ago

I disagree : it can't be just simply tensor cores, since there is in fact a speedup of 2d models by going INT8 on the 1080 Ti, 1660S ! Can you please look into that ?

ttyio commented 3 years ago

Hello @hyperfraise , We functional support INT8 in Pascal architecture, but the INT8 tensorCore support require Turing+ for dgpu products. In TRT we have only INT8 tensorCore kernels for 3dconv.

dmenig commented 3 years ago

Then this is a feature request : could you please provide TRT INT8 kernel for regular cores for 3d conv ?

The fact is there is a speedup with 2d conv on all GPUs by going INT8, TensorCores or not, so, to my humble opinion, there should be roughly the same speedup for 3d conv. Could you guys please look into that ?

ttyio commented 3 years ago

@hyperfraise I will create internal feature request to tracking this, thanks

dmenig commented 3 years ago

Thank you

dmenig commented 3 years ago

My tests on 2070 show in fact a speedup on this 3d architecture. So I retract this part. (I thought it wouldn't because it didn't with another 3d architecture. I'll double check on this and maybe open another issue.)

ttyio commented 2 years ago

Sorry @hyperfraise given we have long back log of RFCs, the management see little value in supporting 3D conv acceleration in Pascal generation of GPUs. So we will not support INT8 3D conv in Pascal.

dmenig commented 2 years ago

Thanks for the answer. But it doesn't seem to me that this is limited to Pascal GPUs. 1650 -> 1660 Ti are Turing GPUs, and as noted, show no speedup either. I believe the issue is that only Tensor cores show a speedup, while regular cores, which are not only present in Pascal but everywhere, don't show any speedup. So it is an ubiquitous issue when you think about it.

nvpohanh commented 2 years ago

1660 GPUs do not have TensorCores so they won't give any speed-up for INT8.

closing this issue for now. Please feel free to reopen if you still have questions. Thanks

dmenig commented 2 years ago

1660 GPUs do not have TensorCores so they won't give any speed-up for INT8.

closing this issue for now. Please feel free to reopen if you still have questions. Thanks

But 1660 does show a speedup for Conv2D. I'm pointing out that it'd be nice if it showed some speedup for Conv3d as well.

nvpohanh commented 2 years ago

It's surprising to me that there is speed-up on 1660. If you can sure the trtexec logs with --verbose --profilingVerbosity=detailed --dumpLayerInfo --dumpProfile --separateProfileRun --noDataTransfers --useCudaGraph --useSpinWait flags, I can take a look at why that's the case.

NVIDIA / TensorRT

3D layers show no speedup in INT8 on cuda cores #1176