NVIDIA / TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
https://developer.nvidia.com/tensorrt
Apache License 2.0
10.52k stars 2.1k forks source link

No speedup from Tensor cores on 3d architecture with groupped convolutions #1198

Closed dmenig closed 3 years ago

dmenig commented 3 years ago

Description

I see no speedup between FP16 and INT8 on grouped convolution model.

Environment

TensorRT Version: 7.2.2.3 NVIDIA GPU: 2080 Ti NVIDIA Driver Version: 460.37 CUDA Version: 11.2 CUDNN Version: 8.1.0 Operating System: Ubuntu 20.04 Python Version (if applicable): 3.8 PyTorch Version (if applicable): 1.8.1

Relevant Files

I use this python3.8 code snippet to save onnx models. The resnext file can be found in comes from https://github.com/kenshohara/3D-ResNets-PyTorch/blob/master/models/resnext.py

import torch
import torchvision
from resnext import generate_model

dummy_input = torch.randn(8, 3, 35, 224, 224).float().cuda()

## Regular 3d model
model = torchvision.models.video.r2plus1d_18().eval().cuda()

## Resnext 3d model
model = generate_model(
    model_depth=101,
    sample_size=224,
    sample_duration=35,
    num_classes=24,
    input_channels=3,
)

with torch.no_grad():
    torch.onnx.export(
        model,
        dummy_input,
        "resnet.onnx",
        verbose=True,
    )

Steps To Reproduce

Use the script above to generate a onnx model. Then optimize it with this command :

# FP16  optimization : 
/usr/src/tensorrt/bin/trtexec --onnx=resnet.onnx --fp16 --workspace=5000 --saveEngine=resnet.trt --inputIOFormats=fp32:chw --outputIOFormats=fp32:chw

# INT8 (quantization) optimization : 
/usr/src/tensorrt/bin/trtexec --onnx=resnet.onnx --best --workspace=5000 --saveEngine=resnet.trt --inputIOFormats=fp32:chw --allowGPUFallback --outputIOFormats=fp32:chw

I use tensorrt from nvidia's container 21.03 https://docs.nvidia.com/deeplearning/tensorrt/container-release-notes/rel_21-03.html#rel_21-03

Speedtest results are the following :

Speeds in spl/s
"Regular" 3d model :
FP16 : 60.3
INT8 : 115.6

Resnext 3d model :
FP16 : 111.2
INT8 : 111.1

INT8 expected speedup should be about *2, am I right ?

I believe that this kind of convolution might still not be supported in TensorRT, according to https://forums.developer.nvidia.com/t/does-tensorrt-support-conv3d-with-tensor-core/113193/9 If this is what is causing the problem, could you open a feature request so that this is solved when you guys have the time, please ?

dmenig commented 3 years ago

Closing due to inactivity. i realize this is not easily reproductible. i'll repost with more reproductible code.