Working with Dynamic Batches the output is always fixed

vilsonrodrigues commented 1 week ago

Description

I am working with TensorRT v10 to do inference with dynamic batches.

My model is ViT base obtained based-on (https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/onnx_ptq)

The model was exported with dynamic axes to batch dimension. In build step a profile was setep.

To do an inference first alloc memory using the max shape (32, 3, 224, 224) to input and (32, 1000) to output.

After copy input data to device memory using device_ptr.

Set input shape in context.

Call do_inference function. But the output is always (32000) to any batch dim.

Environment

TensorRT Version: 10.0.1

NVIDIA GPU: Tesla T4

NVIDIA Driver Version: 530

CUDA Version: 12.2

CUDNN Version: 9.2.0

Operating System:

Python Version (if applicable): 3.10

PyTorch Version (if applicable): 2.3.1

Relevant Files

Model link: https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/onnx_ptq

Steps To Reproduce

Commands or scripts:

import tensorrt as trt
from cuda import cuda, cudart
# https://github.com/NVIDIA/TensorRT/blob/release/10.1/samples/python/common_runtime.py
from common_runtime import *

# load vit engine with dynamic batch = 32

stream = cuda_call(cudart.cudaStreamCreate())
batch_size = None
inputs = []
outputs = []
bindings = []

for i in range(engine.num_io_tensors):

    tensor_name = engine.get_tensor_name(i)

    # If binding is dynamic some dimensions can be -1
    # get_tensor_shape returns shape with dynamic dim, same ONNX
    # get_tensor_profile_shape returns (min_shape, optimal_shape, max_shape)
    # Pick out the max shape to allocate enough memory for the binding    
    if engine.get_tensor_mode(tensor_name) == trt.TensorIOMode.INPUT:
        shape = engine.get_tensor_profile_shape(tensor_name, 0)[-1]
        batch_size = shape[0]
    else:
        shape = engine.get_tensor_shape(tensor_name)
        # Replace dynamic batch dim max input batch 
        shape[0] = batch_size

    # Size in bytes
    size = trt.volume(shape)
    trt_type = engine.get_tensor_dtype(tensor_name)  

    #print(shape)

    # Allocate host and device buffers
    if trt.nptype(trt_type):
        dtype = np.dtype(trt.nptype(trt_type))
        bindingMemory = HostDeviceMem(size, dtype)
    else: # no numpy support: create a byte array instead (BF16, FP8, INT4)
        size = int(size * trt_type.itemsize)
        bindingMemory = HostDeviceMem(size)

    # Append the device buffer to device bindings
    bindings.append(int(bindingMemory.device))

    # Append to the appropriate list
    if engine.get_tensor_mode(tensor_name) == trt.TensorIOMode.INPUT:
        inputs.append(bindingMemory)
    else:
        outputs.append(bindingMemory)

context = engine.create_execution_context()

batch = 4

shape = (batch, 3, 224, 224)
input_data = np.random.rand(*shape).astype("float32")

context.set_input_shape("input", shape)

memcpy_host_to_device(inputs[0].device, input_data)

results = do_inference(context, engine, bindings, inputs, outputs, stream)

results[0].shape
> (32000,)

lix19937 commented 1 week ago

You need provide an optimization profile for dynamic shape engine, like min, opt, max shape, ref https://github.com/lix19937/trt-samples-for-hackathon-cn/blob/master/cookbook/02-API/OptimizationProfile/main-TensorInput.py#L29-L31

vilsonrodrigues commented 1 week ago

Hi, I do it

I create a colab to reproduce my steps. Any help is appreciated. Thank you.

https://colab.research.google.com/drive/1G-l-THRzCCqS41A5OrIEc4X7x1oOu7eB?usp=sharing

vilsonrodrigues commented 5 days ago

the conclusion is that for tensors of different shapes it remains to deallocate the old one and allocate the new one with a different shape.

The notebook was clear and help me. Thanks Iix19937.

NVIDIA / TensorRT