NVIDIA / TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
https://developer.nvidia.com/tensorrt
Apache License 2.0
10.57k stars 2.1k forks source link

Error when build trt engine from onnx model with dynamic shape #1600

Closed ganyk closed 2 years ago

ganyk commented 2 years ago

Hi, I am trying to build TensorRT engine from onnx model which contains dynamic shape. I encounter a build error: "Error Code 10: Internal Error (Could not find any implementation for node Transpose_50.)". The build process can complete if I build the engine without dynamic shape.

here is code that can reproduce the error

import torch
import torch.nn as nn
import torch.nn.functional as F

import tensorrt as trt

TRT_LOGGER = trt.Logger()

class NetVLAD(nn.Module):
    def __init__(self, num_clusters=32, dim=512, normalize_input=True):
        super(NetVLAD, self).__init__()
        self.num_clusters = num_clusters
        self.dim = dim
        self.normalize_input = normalize_input
        self.conv = nn.Conv2d(dim, num_clusters, kernel_size=(1, 1), bias=False)
        self.centroids = nn.Parameter(torch.rand(num_clusters, dim), requires_grad=False)

    def forward(self, x):
        N, C = x.shape[:2]
        if self.normalize_input:
            x = F.normalize(x, p=2, dim=1)  # across descriptor dim

        # soft-assignment
        numbers = x.shape[2] * x.shape[3]
        soft_assign = self.conv(x).view(N, self.num_clusters, numbers)
        soft_assign = F.softmax(soft_assign, dim=1)

        x_flatten = x.view(N, C, numbers)

        # calculate residuals to each clusters in one loop
        residual = x_flatten.expand(self.num_clusters, N, C, numbers).permute(1, 0, 2, 3) - \
            self.centroids.expand(numbers, self.num_clusters, C).permute(1, 2, 0).unsqueeze(0)
        residual *= soft_assign.unsqueeze(2)
        vlad = residual.sum(dim=3)

        return vlad

def generate_onnx(model):
    model.eval()
    dummy_input = torch.randn((16, 512, 32, 32), requires_grad=False)

    # Export the model   
    torch.onnx.export(model,         # model being run 
         dummy_input,       # model input (or a tuple for multiple inputs) 
         "tmp.onnx",       # where to save the model  
         export_params=True,  # store the trained parameter weights inside the model file 
         opset_version=11,    # the ONNX version to export the model to 
         do_constant_folding=False,  # whether to execute constant folding for optimization 
         input_names = ['input_1'],   # the model's input names 
         output_names = ['output_1'], # the model's output names 
         dynamic_axes={'input_1' : {0 : 'batch_size'},    # variable length axes 
                                'output_1' : {0 : 'batch_size'}}) 

def build_trt_engine(onnx_file):
    builder = trt.Builder(TRT_LOGGER)
    network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
    parser = trt.OnnxParser(network, TRT_LOGGER)

    config = builder.create_builder_config()

    # allow TensorRT to use up to 1GB of GPU memory for tactic selection
    config.max_workspace_size = 1 << 30
    # we have only one image in batch
    builder.max_batch_size = 256

    profile = builder.create_optimization_profile()
    profile.set_shape("input_1", (1, 512, 32, 32), (16, 512, 32, 32), (256, 512, 32, 32))
    config.add_optimization_profile(profile)

    # parse ONNX
    with open(onnx_file, 'rb') as model:
        print('Beginning ONNX file parsing')
        parser.parse(model.read())
    print('Completed parsing of ONNX file')

    # generate TensorRT engine optimized for the target platform
    print('Building an engine...')
    engine = builder.build_engine(network, config)
    context = engine.create_execution_context()
    print("Completed creating Engine")
    return engine, context

if __name__ == '__main__':
    model = NetVLAD()
    generate_onnx(model)
    build_trt_engine('tmp.onnx')

@ttyio

oxana-nvidia commented 2 years ago

Hi @ganyk, Thanks for providing detailed repro!

The issue here is not enough memory on your device to run this network with input (256, 512, 32, 32). Even without dynamic shapes. I'm getting OOM error in torch.onnx.export when I'm running without dynamic shapes and this input: dummy_input = torch.randn((256, 512, 32, 32), requires_grad=False)

I suggest you adjust kMAX in your shape to a smaller value. For example, on my local setup 128 works ok:

profile.set_shape("input_1", (1, 512, 32, 32), (16, 512, 32, 32), (128, 512, 32, 32))

nvpohanh commented 2 years ago

closing for now due to >14 days with no response. Please feel free to reopen if the issue still exists. Thanks