NVIDIA / TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
https://developer.nvidia.com/tensorrt
Apache License 2.0
10.75k stars 2.13k forks source link

How to change the batch size when converting ONNX with explicit batch size to TensorRT #1915

Closed deephog closed 2 years ago

deephog commented 2 years ago

Hi Developers,

The problem I'm facing is that, I cannot export a batch size 4 model in ONNX because it will exceed the 2 GB proto buffer limit, so I can only export an ONNX with batchsize =1 (which is already 1.2 GB). What I would like to ask is that, is there a way to change the batch size of the model when converting it to TensorRT?

Thanks!

pranavm-nvidia commented 2 years ago

Why does increasing the batch size change the size of the ONNX model? I don't think weights should be affected by batch size.

In any case, you should be able to modify the batch size directly in the TRT network, after parsing from ONNX. For example:

network->getInput(0)->setDimensions(Dims{4, {4, 3, 224, 224}});

Or the equivalent in Python:

network.get_input(0).shape = (4, 3, 224, 224)
deephog commented 2 years ago

Why does increasing the batch size change the size of the ONNX model? I don't think weights should be affected by batch size.

In any case, you should be able to modify the batch size directly in the TRT network, after parsing from ONNX. For example:

network->getInput(0)->setDimensions(Dims{4, {4, 3, 224, 224}});

Or the equivalent in Python:

network.get_input(0).shape = (4, 3, 224, 224)

thanks for the quick reply. I think it is due to lots of constant folding operations in the model, and I used onnx simplifier before the conversion to trt, which significantly increase the model size. I would definitely try in a custom builder with the code you suggested, is there an option in trtexec that does the same trick?

thanks!

pranavm-nvidia commented 2 years ago

Unfortunately, there's no option in trtexec that does that.

Is it possible to modify the model prior to constant folding such that those large tensors aren't generated? e.g. there may be a pattern like constant -> Tile or constant -> Expand just before an element-wise operation to enable broadcasting (I've seen PyTorch export models like that before). Since ONNX supports broadcasting already, that's typically unnecessary and the Expand/Tile can be removed.

deephog commented 2 years ago
network.get_input(0).shape = (4, 3, 224, 224)

I tried your method, and i can see the input shape of the trt engine changed as I specified it. However, the inference time of the engine doesn't change no matter what batch size I used when compiling the engine. I have the engine compiling code here: link

and my onnx file here: link

Maybe I'm not doing it the correct way. Please share more detailed guidance of how to do it, thanks!

pranavm-nvidia commented 2 years ago

Are you getting correct outputs? If your model has poor GPU occupancy, inference time might not change even if you increase batch size. Instead, you'd just see higher GPU utilization with no change in latency.

deephog commented 2 years ago

Are you getting correct outputs? If your model has poor GPU occupancy, inference time might not change even if you increase batch size. Instead, you'd just see higher GPU utilization with no change in latency.

The occupancy of graphic memory is indeed increasing proportionally as I change the batch size. However, from batch 1 to 16, the inference time doesn't change at all, which makes no sense. Could you please take a look at my code? it is directly modified from the sample provided, so I guess there must be something wrong.

pranavm-nvidia commented 2 years ago

Your code looks right to me. Like I said, if your model has low GPU occupancy, you may not see any changes in latency even as you increase the amount of work you're doing on the GPU.

If you're getting functionally correct behavior, i.e. correct outputs, then it means increasing the batch size is just giving you better parallelism.

deephog commented 2 years ago

Your code looks right to me. Like I said, if your model has low GPU occupancy, you may not see any changes in latency even as you increase the amount of work you're doing on the GPU.

If you're getting functionally correct behavior, i.e. correct outputs, then it means increasing the batch size is just giving you better parallelism.

Sorry I forgot to mention, it is not a tiny model. I just increased the batch size to 32 and this time it occupied almost the entire 24 GB of my 3090. However, the inference time is still the same. And I did put steam.synchronize() there to make sure all operations are finished before I count the timer.

deephog commented 2 years ago

Your code looks right to me. Like I said, if your model has low GPU occupancy, you may not see any changes in latency even as you increase the amount of work you're doing on the GPU.

If you're getting functionally correct behavior, i.e. correct outputs, then it means increasing the batch size is just giving you better parallelism.

I used dummy input just to test the speed. I will try to put real inputs to see if the outputs are correct, but I highly doubt it would be so.

deephog commented 2 years ago

Your code looks right to me. Like I said, if your model has low GPU occupancy, you may not see any changes in latency even as you increase the amount of work you're doing on the GPU.

If you're getting functionally correct behavior, i.e. correct outputs, then it means increasing the batch size is just giving you better parallelism.

When trying to output the result, I found that the output shape of the model contains still only 1 batch, do I need to change the output shape of the network as well before I compile the engine?

PINTO0309 commented 2 years ago

@deephog If you are in that situation, you need to export with an indefinite batch size, such as -1 or N, at the stage of exporting from PyTorch, etc. to ONNX. In my experience, TensorRT 8.2.x does not allow variable batch sizes to be set on ONNX with fixed batch sizes. For example, trtexec and onnx2trt. At least it was possible in TensorRT 8.0.x and earlier.

The most important thing to keep in mind is not to optimize ONNX with onnx-simplifier. pranavm-nvidia is correct in pointing out that the model size is unnecessarily bloated at the time when onnx-simplifier replaces Tile and Expand with constants. While the structure is optimized, a large number of useless constants are extrapolated.

deephog commented 2 years ago

@deephog If you are in that situation, you need to export with an indefinite batch size, such as -1 or N, at the stage of exporting from PyTorch, etc. to ONNX. In my experience, TensorRT 8.2.x does not allow variable batch sizes to be set on ONNX with fixed batch sizes. For example, trtexec and onnx2trt. At least it was possible in TensorRT 8.0.x and earlier.

The most important thing to keep in mind is not to optimize ONNX with onnx-simplifier. pranavm-nvidia is correct in pointing out that the model size is unnecessarily bloated at the time when onnx-simplifier replaces Tile and Expand with constants. While the structure is optimized, a large number of useless constants are extrapolated.

Hi Hyodo san, nice to have you here. Just to clarify one thing. You said I need to export it as indefinite batch size, which is different from the variable batch size you later mentioned right? I'm using 8.2 and 8.4 in different containers, so I just want to make sure it is supported. Thanks!

PINTO0309 commented 2 years ago

I have misled you. Replace N with 4.

When I run models, I can choose to generate models with a fixed larger batch size or with a variable batch size.

deephog commented 2 years ago

I have misled you. Replace N with 4.

When I run models, I can choose to generate models with a fixed larger batch size or with a variable batch size.

Yes, generating a batch 4 onnx is not a problem, but afterwards, if I want to convert it to trt, there are two cases:

  1. If I generate it right away, there is no size problem, but there are bunch of "pad" and "mod" operations that are not supported by TensorRT, I tried to eliminate some of them in the original model code, but it needs much more work.
  2. Onnx simplifier will eliminate all those operations automatically, but after your workaround, our model is still at 1.2 GB for batch-size 1, when I increase it to batch-size 4, onnx simplifier failed again because of 2GB protobuf limit.

So Hyodo san, let me triple check. You said I should put batch size as -1 as indefinite, then I follow pranavm-nvidia 's method to specify the shape of the input?

PINTO0309 commented 2 years ago

If the batch size is variable, the batch size must be initialized at the time of export from PyTorch. Otherwise, experience shows that Convolution and Reshape operations cause problems.

One more point of another note. When creating a PyTorch model, as an example, using -1 for the axis(axes) attribute of Reshape will cause errors everywhere in the model transformation, so I recommend specifying a fixed shape for the most obvious parts as much as possible.

onnx_file = "xxx_NxHxW.onnx"
x = torch.randn(1, 3, 224, 224).cpu()
torch.onnx.export(
    model,
    args=(x),
    f=onnx_file,
    opset_version=11,
    input_names=['input'],
    output_names=['output'],
    dynamic_axes={
        'input' : {0: '-1'},
    }
)
deephog commented 2 years ago

If the batch size is variable, the batch size must be initialized at the time of export from PyTorch. Otherwise, experience shows that Convolution and Reshape operations cause problems.

One more point of another note. When creating a PyTorch model, as an example, using -1 for the axis(axes) attribute of Reshape will cause errors everywhere in the model transformation, so I recommend specifying a fixed shape for the most obvious parts as much as possible.

onnx_file = "xxx_NxHxW.onnx"
x = torch.randn(1, 3, 224, 224).cpu()
torch.onnx.export(
    model,
    args=(x),
    f=onnx_file,
    opset_version=11,
    input_names=['input'],
    output_names=['output'],
    dynamic_axes={
        'input' : {0: '-1'},
    }
)

Yes I did similar thing, but the paradoxy still persist, because when I need to do onnxsim, it requires fixation of those variable axes, so back to the starting point. But if I don't do onnxsim, the model won't pass the compilation of trt.

The only thing now I can think of, is to do what pranavm-nvidia suggested, I first export onnx with batch-size = 1, do onnxsim to it, and somehow change the batch size when I am compiling the trt.

PINTO0309 commented 2 years ago

The only thing now I can think of, is to do what pranavm-nvidia suggested, I first export onnx with batch-size = 1, do onnxsim to it, and somehow change the batch size when I am compiling the trt.

I think that workaround has not worked since TensorRT 8.2. However, I think it is worth a try.

My article: https://zenn.dev/link/comments/a3fd3560057024

  1. If the batch size is not set to variable when exporting from PyTorch to ONNX, the Conv2D batch size will be embedded with a fixed size of "1".
  2. If you set the batch size to "-1" when exporting from PyTorch to ONNX and do not update the batch size to "-1" by overwriting it in the batchsize_clear.py script, the batch size in Reshape will be fixed at "1".
  3. If any dimension other than the batch size of Reshape is originally set to "-1", a runtime error will occur at the time of inference in onnxruntime, so it is necessary to rewrite all dimensions other than the batch size of Reshape to fixed size when first exporting from PyTorch to ONNX.

Although it cannot be used for all models, this script initializes only the batch size after optimizing the model structure to the limit by specifying a fixed batch size in onnx-simplifier. https://github.com/PINTO0309/PINTO_model_zoo/blob/main/268_Lite-HRNet/batchsize_clear.py

import onnx
import os
import struct

from argparse import ArgumentParser

def rebatch(infile, outfile, batch_size):
    model = onnx.load(infile)
    graph = model.graph

    # Change batch size in input, output and value_info
    for tensor in list(graph.input) + list(graph.value_info) + list(graph.output):
        tensor.type.tensor_type.shape.dim[0].dim_param = batch_size

    # Set dynamic batch size in reshapes (-1)
    for node in  graph.node:
        if node.op_type != 'Reshape':
            continue
        for init in graph.initializer:
            # node.input[1] is expected to be a reshape
            if init.name != node.input[1]:
                continue
            # Shape is stored as a list of ints
            if len(init.int64_data) > 0:
                # This overwrites bias nodes' reshape shape but should be fine
                init.int64_data[0] = -1
            # Shape is stored as bytes
            elif len(init.raw_data) > 0:
                shape = bytearray(init.raw_data)
                struct.pack_into('q', shape, 0, -1)
                init.raw_data = bytes(shape)

    onnx.save(model, outfile)

if __name__ == '__main__':
    parser = ArgumentParser('Replace batch size with \'N\'')
    parser.add_argument('infile')
    parser.add_argument('outfile')
    args = parser.parse_args()

    rebatch(args.infile, args.outfile, '-1')
deephog commented 2 years ago

The only thing now I can think of, is to do what pranavm-nvidia suggested, I first export onnx with batch-size = 1, do onnxsim to it, and somehow change the batch size when I am compiling the trt.

I think that workaround has not worked since TensorRT 8.2. However, I think it is worth a try.

My article: https://zenn.dev/link/comments/a3fd3560057024

  1. If the batch size is not set to variable when exporting from PyTorch to ONNX, the Conv2D batch size will be embedded with a fixed size of "1".
  2. If you set the batch size to "-1" when exporting from PyTorch to ONNX and do not update the batch size to "-1" by overwriting it in the batchsize_clear.py script, the batch size in Reshape will be fixed at "1".
  3. If any dimension other than the batch size of Reshape is originally set to "-1", a runtime error will occur at the time of inference in onnxruntime, so it is necessary to rewrite all dimensions other than the batch size of Reshape to fixed size when first exporting from PyTorch to ONNX.

Although it cannot be used for all models, this script initializes only the batch size after optimizing the model structure to the limit by specifying a fixed batch size in onnx-simplifier. https://github.com/PINTO0309/PINTO_model_zoo/blob/main/268_Lite-HRNet/batchsize_clear.py

import onnx
import os
import struct

from argparse import ArgumentParser

def rebatch(infile, outfile, batch_size):
    model = onnx.load(infile)
    graph = model.graph

    # Change batch size in input, output and value_info
    for tensor in list(graph.input) + list(graph.value_info) + list(graph.output):
        tensor.type.tensor_type.shape.dim[0].dim_param = batch_size

    # Set dynamic batch size in reshapes (-1)
    for node in  graph.node:
        if node.op_type != 'Reshape':
            continue
        for init in graph.initializer:
            # node.input[1] is expected to be a reshape
            if init.name != node.input[1]:
                continue
            # Shape is stored as a list of ints
            if len(init.int64_data) > 0:
                # This overwrites bias nodes' reshape shape but should be fine
                init.int64_data[0] = -1
            # Shape is stored as bytes
            elif len(init.raw_data) > 0:
                shape = bytearray(init.raw_data)
                struct.pack_into('q', shape, 0, -1)
                init.raw_data = bytes(shape)

    onnx.save(model, outfile)

if __name__ == '__main__':
    parser = ArgumentParser('Replace batch size with \'N\'')
    parser.add_argument('infile')
    parser.add_argument('outfile')
    args = parser.parse_args()

    rebatch(args.infile, args.outfile, '-1')

ah I see what you mean, let me try your script, thanks!

pranavm-nvidia commented 2 years ago

@deephog Another option would be to export with variable batch size and then fold constants using Polygraphy (which can handle dynamic shapes) rather than onnxsim.

deephog commented 2 years ago

@pranavm-nvidia @PINTO0309 I tried both your suggestions. Eventually, @PINTO0309 's hack script worked. Polygraphy does not eliminate all the ops that Tensorrt doesn't support , I still need to fall back to onnx simplifier. Basically, I exported onnx with batch=1, run onnxsim, then run @PINTO0309 's script to convert the batch size back to -1, then run tensorrt engine compiler with explicit input shape as suggested.

Like @PINTO0309 said, the script isn't a cure for all, I still changed the model a little when some of the layers or tensors have batch size that cannot be modified by the script. And also, batch size 4 is indeed too large for this model, it's a disparity model which has a cost volume actually exceeded the tensor size limit (2GB) of Tensorrt (while the entire model only occupied 2.8G of memory), I had to truncate the model a little to fit into this limit. @ttyio @pranavm-nvidia I don't know if there is a circumventing to this limit or not.

Thanks!

deephog commented 2 years ago

@pranavm-nvidia and also, I read about leaving the model as 1-batch and manually creating multiple streams to process the concurrent data feeds. It seems like a plan to ease the Tensor being two large issue.

PINTO0309 commented 2 years ago

@deephog You are probably trying to convert HITNet, but there are models that are lighter in cost volume and higher in performance. https://github.com/PINTO0309/PINTO_model_zoo/tree/main/284_CREStereo 162555069-449570d2-7476-4d10-ac3b-c50876a63782

image

PINTO0309 commented 2 years ago

@deephog Alternatively, as another working option, you can try breaking the model down into smaller parts and then optimizing it with onnx-simplifier. Because I have the same problem, I created several of my own ONNX processing tools in one week. I am making good use of NIVIDA's onnx-graphsurgeon.

"A set of simple tools for splitting, merging, OP deletion, size compression, rewriting attributes and constants, OP generation, and change opset for ONNX models." https://github.com/PINTO0309/simple-onnx-processing-tools image

deephog commented 2 years ago

@PINTO0309 Thank you for the advice! It is indeed incredible, we tried RAFT-Stereo, but this seems to be a further upgrade. The thing is, we tried StereoNet and HITNet mostly for its fast inference (not the regular HITNet, but the TinyHITNet). I will definitely try CREStereo, but do you have anything else in mind that is comparable to the speed of StereoNet but with a higher performance? TinyHITNet may already be passed, even though I configured it to be comparable speed in Pytorch, it can not be further optimized by TensorRT as much as StereoNet due to unknown reasons.

PINTO0309 commented 2 years ago

HITNet (not Tiny-HITNet) + TensorRT Demo https://github.com/PINTO0309/20220228_intel_deeplearning_day_hitnet_demo

Once again, CREStereo's cost volume is lighter than any of the other stereo depth estimation models currently available. All of the following are not practical. https://github.com/PINTO0309/PINTO_model_zoo#7-depth-estimation-from-monocularstereo-images

  1. TinyHITNet
  2. MobileStereoNet
  3. StereoNet
  4. RAFT-Stereo
  5. A-TVSNet
  6. CasStereoNet
  7. W-Stereo-Disp
  8. stereoDNN
  9. RealtimeStereo
  10. CoEx
deephog commented 2 years ago

@PINTO0309 I opened up a new issue here under the Pytorch implementation of CREStereo, you may want to take a look. Thanks!

PINTO0309 commented 2 years ago

@deephog URL is not working. https://treogithub.com/ibaiGorordo/CREStereo-Pytorch/issues/5

https://github.com/ibaiGorordo/CREStereo-Pytorch/issues/5