Closed deephog closed 2 years ago
Why does increasing the batch size change the size of the ONNX model? I don't think weights should be affected by batch size.
In any case, you should be able to modify the batch size directly in the TRT network, after parsing from ONNX. For example:
network->getInput(0)->setDimensions(Dims{4, {4, 3, 224, 224}});
Or the equivalent in Python:
network.get_input(0).shape = (4, 3, 224, 224)
Why does increasing the batch size change the size of the ONNX model? I don't think weights should be affected by batch size.
In any case, you should be able to modify the batch size directly in the TRT network, after parsing from ONNX. For example:
network->getInput(0)->setDimensions(Dims{4, {4, 3, 224, 224}});
Or the equivalent in Python:
network.get_input(0).shape = (4, 3, 224, 224)
thanks for the quick reply. I think it is due to lots of constant folding operations in the model, and I used onnx simplifier before the conversion to trt, which significantly increase the model size. I would definitely try in a custom builder with the code you suggested, is there an option in trtexec that does the same trick?
thanks!
Unfortunately, there's no option in trtexec
that does that.
Is it possible to modify the model prior to constant folding such that those large tensors aren't generated?
e.g. there may be a pattern like constant -> Tile
or constant -> Expand
just before an element-wise operation to enable broadcasting (I've seen PyTorch export models like that before). Since ONNX supports broadcasting already, that's typically unnecessary and the Expand/Tile can be removed.
network.get_input(0).shape = (4, 3, 224, 224)
I tried your method, and i can see the input shape of the trt engine changed as I specified it. However, the inference time of the engine doesn't change no matter what batch size I used when compiling the engine. I have the engine compiling code here: link
and my onnx file here: link
Maybe I'm not doing it the correct way. Please share more detailed guidance of how to do it, thanks!
Are you getting correct outputs? If your model has poor GPU occupancy, inference time might not change even if you increase batch size. Instead, you'd just see higher GPU utilization with no change in latency.
Are you getting correct outputs? If your model has poor GPU occupancy, inference time might not change even if you increase batch size. Instead, you'd just see higher GPU utilization with no change in latency.
The occupancy of graphic memory is indeed increasing proportionally as I change the batch size. However, from batch 1 to 16, the inference time doesn't change at all, which makes no sense. Could you please take a look at my code? it is directly modified from the sample provided, so I guess there must be something wrong.
Your code looks right to me. Like I said, if your model has low GPU occupancy, you may not see any changes in latency even as you increase the amount of work you're doing on the GPU.
If you're getting functionally correct behavior, i.e. correct outputs, then it means increasing the batch size is just giving you better parallelism.
Your code looks right to me. Like I said, if your model has low GPU occupancy, you may not see any changes in latency even as you increase the amount of work you're doing on the GPU.
If you're getting functionally correct behavior, i.e. correct outputs, then it means increasing the batch size is just giving you better parallelism.
Sorry I forgot to mention, it is not a tiny model. I just increased the batch size to 32 and this time it occupied almost the entire 24 GB of my 3090. However, the inference time is still the same. And I did put steam.synchronize() there to make sure all operations are finished before I count the timer.
Your code looks right to me. Like I said, if your model has low GPU occupancy, you may not see any changes in latency even as you increase the amount of work you're doing on the GPU.
If you're getting functionally correct behavior, i.e. correct outputs, then it means increasing the batch size is just giving you better parallelism.
I used dummy input just to test the speed. I will try to put real inputs to see if the outputs are correct, but I highly doubt it would be so.
Your code looks right to me. Like I said, if your model has low GPU occupancy, you may not see any changes in latency even as you increase the amount of work you're doing on the GPU.
If you're getting functionally correct behavior, i.e. correct outputs, then it means increasing the batch size is just giving you better parallelism.
When trying to output the result, I found that the output shape of the model contains still only 1 batch, do I need to change the output shape of the network as well before I compile the engine?
@deephog If you are in that situation, you need to export with an indefinite batch size, such as -1 or N, at the stage of exporting from PyTorch, etc. to ONNX. In my experience, TensorRT 8.2.x does not allow variable batch sizes to be set on ONNX with fixed batch sizes. For example, trtexec and onnx2trt. At least it was possible in TensorRT 8.0.x and earlier.
The most important thing to keep in mind is not to optimize ONNX with onnx-simplifier. pranavm-nvidia is correct in pointing out that the model size is unnecessarily bloated at the time when onnx-simplifier replaces Tile
and Expand
with constants. While the structure is optimized, a large number of useless constants are extrapolated.
@deephog If you are in that situation, you need to export with an indefinite batch size, such as -1 or N, at the stage of exporting from PyTorch, etc. to ONNX. In my experience, TensorRT 8.2.x does not allow variable batch sizes to be set on ONNX with fixed batch sizes. For example, trtexec and onnx2trt. At least it was possible in TensorRT 8.0.x and earlier.
The most important thing to keep in mind is not to optimize ONNX with onnx-simplifier. pranavm-nvidia is correct in pointing out that the model size is unnecessarily bloated at the time when onnx-simplifier replaces
Tile
andExpand
with constants. While the structure is optimized, a large number of useless constants are extrapolated.
Hi Hyodo san, nice to have you here. Just to clarify one thing. You said I need to export it as indefinite batch size, which is different from the variable batch size you later mentioned right? I'm using 8.2 and 8.4 in different containers, so I just want to make sure it is supported. Thanks!
I have misled you. Replace N with 4.
When I run models, I can choose to generate models with a fixed larger batch size or with a variable batch size.
I have misled you. Replace N with 4.
When I run models, I can choose to generate models with a fixed larger batch size or with a variable batch size.
Yes, generating a batch 4 onnx is not a problem, but afterwards, if I want to convert it to trt, there are two cases:
So Hyodo san, let me triple check. You said I should put batch size as -1 as indefinite, then I follow pranavm-nvidia 's method to specify the shape of the input?
If the batch size is variable, the batch size must be initialized at the time of export from PyTorch. Otherwise, experience shows that Convolution and Reshape operations cause problems.
One more point of another note. When creating a PyTorch model, as an example, using -1 for the axis(axes) attribute of Reshape will cause errors everywhere in the model transformation, so I recommend specifying a fixed shape for the most obvious parts as much as possible.
onnx_file = "xxx_NxHxW.onnx"
x = torch.randn(1, 3, 224, 224).cpu()
torch.onnx.export(
model,
args=(x),
f=onnx_file,
opset_version=11,
input_names=['input'],
output_names=['output'],
dynamic_axes={
'input' : {0: '-1'},
}
)
If the batch size is variable, the batch size must be initialized at the time of export from PyTorch. Otherwise, experience shows that Convolution and Reshape operations cause problems.
One more point of another note. When creating a PyTorch model, as an example, using -1 for the axis(axes) attribute of Reshape will cause errors everywhere in the model transformation, so I recommend specifying a fixed shape for the most obvious parts as much as possible.
onnx_file = "xxx_NxHxW.onnx" x = torch.randn(1, 3, 224, 224).cpu() torch.onnx.export( model, args=(x), f=onnx_file, opset_version=11, input_names=['input'], output_names=['output'], dynamic_axes={ 'input' : {0: '-1'}, } )
Yes I did similar thing, but the paradoxy still persist, because when I need to do onnxsim, it requires fixation of those variable axes, so back to the starting point. But if I don't do onnxsim, the model won't pass the compilation of trt.
The only thing now I can think of, is to do what pranavm-nvidia suggested, I first export onnx with batch-size = 1, do onnxsim to it, and somehow change the batch size when I am compiling the trt.
The only thing now I can think of, is to do what pranavm-nvidia suggested, I first export onnx with batch-size = 1, do onnxsim to it, and somehow change the batch size when I am compiling the trt.
I think that workaround has not worked since TensorRT 8.2. However, I think it is worth a try.
My article: https://zenn.dev/link/comments/a3fd3560057024
Although it cannot be used for all models, this script initializes only the batch size after optimizing the model structure to the limit by specifying a fixed batch size in onnx-simplifier. https://github.com/PINTO0309/PINTO_model_zoo/blob/main/268_Lite-HRNet/batchsize_clear.py
import onnx
import os
import struct
from argparse import ArgumentParser
def rebatch(infile, outfile, batch_size):
model = onnx.load(infile)
graph = model.graph
# Change batch size in input, output and value_info
for tensor in list(graph.input) + list(graph.value_info) + list(graph.output):
tensor.type.tensor_type.shape.dim[0].dim_param = batch_size
# Set dynamic batch size in reshapes (-1)
for node in graph.node:
if node.op_type != 'Reshape':
continue
for init in graph.initializer:
# node.input[1] is expected to be a reshape
if init.name != node.input[1]:
continue
# Shape is stored as a list of ints
if len(init.int64_data) > 0:
# This overwrites bias nodes' reshape shape but should be fine
init.int64_data[0] = -1
# Shape is stored as bytes
elif len(init.raw_data) > 0:
shape = bytearray(init.raw_data)
struct.pack_into('q', shape, 0, -1)
init.raw_data = bytes(shape)
onnx.save(model, outfile)
if __name__ == '__main__':
parser = ArgumentParser('Replace batch size with \'N\'')
parser.add_argument('infile')
parser.add_argument('outfile')
args = parser.parse_args()
rebatch(args.infile, args.outfile, '-1')
The only thing now I can think of, is to do what pranavm-nvidia suggested, I first export onnx with batch-size = 1, do onnxsim to it, and somehow change the batch size when I am compiling the trt.
I think that workaround has not worked since TensorRT 8.2. However, I think it is worth a try.
My article: https://zenn.dev/link/comments/a3fd3560057024
- If the batch size is not set to variable when exporting from PyTorch to ONNX, the Conv2D batch size will be embedded with a fixed size of "1".
- If you set the batch size to "-1" when exporting from PyTorch to ONNX and do not update the batch size to "-1" by overwriting it in the batchsize_clear.py script, the batch size in Reshape will be fixed at "1".
- If any dimension other than the batch size of Reshape is originally set to "-1", a runtime error will occur at the time of inference in onnxruntime, so it is necessary to rewrite all dimensions other than the batch size of Reshape to fixed size when first exporting from PyTorch to ONNX.
Although it cannot be used for all models, this script initializes only the batch size after optimizing the model structure to the limit by specifying a fixed batch size in onnx-simplifier. https://github.com/PINTO0309/PINTO_model_zoo/blob/main/268_Lite-HRNet/batchsize_clear.py
import onnx import os import struct from argparse import ArgumentParser def rebatch(infile, outfile, batch_size): model = onnx.load(infile) graph = model.graph # Change batch size in input, output and value_info for tensor in list(graph.input) + list(graph.value_info) + list(graph.output): tensor.type.tensor_type.shape.dim[0].dim_param = batch_size # Set dynamic batch size in reshapes (-1) for node in graph.node: if node.op_type != 'Reshape': continue for init in graph.initializer: # node.input[1] is expected to be a reshape if init.name != node.input[1]: continue # Shape is stored as a list of ints if len(init.int64_data) > 0: # This overwrites bias nodes' reshape shape but should be fine init.int64_data[0] = -1 # Shape is stored as bytes elif len(init.raw_data) > 0: shape = bytearray(init.raw_data) struct.pack_into('q', shape, 0, -1) init.raw_data = bytes(shape) onnx.save(model, outfile) if __name__ == '__main__': parser = ArgumentParser('Replace batch size with \'N\'') parser.add_argument('infile') parser.add_argument('outfile') args = parser.parse_args() rebatch(args.infile, args.outfile, '-1')
ah I see what you mean, let me try your script, thanks!
@deephog Another option would be to export with variable batch size and then fold constants using Polygraphy (which can handle dynamic shapes) rather than onnxsim.
@pranavm-nvidia @PINTO0309 I tried both your suggestions. Eventually, @PINTO0309 's hack script worked. Polygraphy does not eliminate all the ops that Tensorrt doesn't support , I still need to fall back to onnx simplifier. Basically, I exported onnx with batch=1, run onnxsim, then run @PINTO0309 's script to convert the batch size back to -1, then run tensorrt engine compiler with explicit input shape as suggested.
Like @PINTO0309 said, the script isn't a cure for all, I still changed the model a little when some of the layers or tensors have batch size that cannot be modified by the script. And also, batch size 4 is indeed too large for this model, it's a disparity model which has a cost volume actually exceeded the tensor size limit (2GB) of Tensorrt (while the entire model only occupied 2.8G of memory), I had to truncate the model a little to fit into this limit. @ttyio @pranavm-nvidia I don't know if there is a circumventing to this limit or not.
Thanks!
@pranavm-nvidia and also, I read about leaving the model as 1-batch and manually creating multiple streams to process the concurrent data feeds. It seems like a plan to ease the Tensor being two large issue.
@deephog You are probably trying to convert HITNet, but there are models that are lighter in cost volume and higher in performance. https://github.com/PINTO0309/PINTO_model_zoo/tree/main/284_CREStereo
@deephog Alternatively, as another working option, you can try breaking the model down into smaller parts and then optimizing it with onnx-simplifier. Because I have the same problem, I created several of my own ONNX processing tools in one week. I am making good use of NIVIDA's onnx-graphsurgeon.
"A set of simple tools for splitting, merging, OP deletion, size compression, rewriting attributes and constants, OP generation, and change opset for ONNX models." https://github.com/PINTO0309/simple-onnx-processing-tools
@PINTO0309 Thank you for the advice! It is indeed incredible, we tried RAFT-Stereo, but this seems to be a further upgrade. The thing is, we tried StereoNet and HITNet mostly for its fast inference (not the regular HITNet, but the TinyHITNet). I will definitely try CREStereo, but do you have anything else in mind that is comparable to the speed of StereoNet but with a higher performance? TinyHITNet may already be passed, even though I configured it to be comparable speed in Pytorch, it can not be further optimized by TensorRT as much as StereoNet due to unknown reasons.
HITNet (not Tiny-HITNet) + TensorRT Demo https://github.com/PINTO0309/20220228_intel_deeplearning_day_hitnet_demo
Once again, CREStereo's cost volume is lighter than any of the other stereo depth estimation models currently available. All of the following are not practical. https://github.com/PINTO0309/PINTO_model_zoo#7-depth-estimation-from-monocularstereo-images
@PINTO0309 I opened up a new issue here under the Pytorch implementation of CREStereo, you may want to take a look. Thanks!
@deephog URL is not working. https://treogithub.com/ibaiGorordo/CREStereo-Pytorch/issues/5
Hi Developers,
The problem I'm facing is that, I cannot export a batch size 4 model in ONNX because it will exceed the 2 GB proto buffer limit, so I can only export an ONNX with batchsize =1 (which is already 1.2 GB). What I would like to ask is that, is there a way to change the batch size of the model when converting it to TensorRT?
Thanks!