Open Michelvl92 opened 3 months ago
Could you share the ONNX model? I don't have the permission to access your files
Also, our experiments showed that sparsity only have benefit when convolution channels are large enough (256 or above).
I think sparse conv may not necessarily be faster than dense conv in many cases. TRT did not provide specific names for each strategy/tactics but time cost, only the last picked tactic's impl.
Thanks for the quick response I have made the links accessible now, but just to be sure here are the links:
Would be nice to understand why inference speed does not improve. If this is due to dimensions that can not accelerate in sparse, then I would love to have a link to documentation. Else would love to know how I can improve inference speed by making use of sparsity.
@lix19937 @nvpohanh I have open the acces to the models, could you have a look for me?
I checked the models, and most the convs do not have "good" shape. Due to hardware alignment requirement, sparse kernels require larger tile sizes than dense kernels. Therefore, if you really want to get the benefit of sparse kernels, make sure that the input/output channels of the convs are at least 256 and are multiples of 128.
Some numbers of channels of most convs in this model are: 48, 96, 192, 288, 576
@nvpohanh, thanks for checking. Do you have links to documentation with more details? (I couldn't find it.) What are the technical details of why "input/output channels of the convs are at least 256 and are multiples of 128" for speedup are required?
Those are not hard-coded requirements. They are just based on our past observations regarding sparse vs dense kernel perf.
@nvpohanh So I did some tests with the backbone of yolov8l (first 20 layers), changing the backbone of the model to the rules you mentioned (as you can see below) "if you really want to get the benefit of sparse kernels, make sure that the input/output channels of the convs are at least 256 and are multiples of 128" this is done in the yolov8l model size (FP16). I also used just to be sure 1024x1024 as input size, and tested multiple batch sizes, but saw almost no difference. What could be the reason for this?
Layer: Conv2d(3, 64, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) Input Channels: 3, Output Channels: 64 Layer: Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) Input Channels: 64, Output Channels: 128 Layer: Conv2d(128, 128, kernel_size=(1, 1), stride=(1, 1), bias=False) Input Channels: 128, Output Channels: 128 Layer: Conv2d(320, 128, kernel_size=(1, 1), stride=(1, 1), bias=False) Input Channels: 320, Output Channels: 128 Layer: Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 64, Output Channels: 64 Layer: Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 64, Output Channels: 64 Layer: Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 64, Output Channels: 64 Layer: Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 64, Output Channels: 64 Layer: Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 64, Output Channels: 64 Layer: Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 64, Output Channels: 64 Layer: Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) Input Channels: 128, Output Channels: 256 Layer: Conv2d(256, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) Input Channels: 256, Output Channels: 256 Layer: Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) Input Channels: 1024, Output Channels: 256 Layer: Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 128, Output Channels: 128 Layer: Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 128, Output Channels: 128 Layer: Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 128, Output Channels: 128 Layer: Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 128, Output Channels: 128 Layer: Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 128, Output Channels: 128 Layer: Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 128, Output Channels: 128 Layer: Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 128, Output Channels: 128 Layer: Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 128, Output Channels: 128 Layer: Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 128, Output Channels: 128 Layer: Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 128, Output Channels: 128 Layer: Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 128, Output Channels: 128 Layer: Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 128, Output Channels: 128 Layer: Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) Input Channels: 256, Output Channels: 512 Layer: Conv2d(512, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) Input Channels: 512, Output Channels: 512 Layer: Conv2d(2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) Input Channels: 256, Output Channels: 256 Layer: Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 256, Output Channels: 256 Layer: Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 256, Output Channels: 256 Layer: Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 256, Output Channels: 256 Layer: Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 256, Output Channels: 256 Layer: Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 256, Output Channels: 256 Layer: Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 256, Output Channels: 256 Layer: Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 256, Output Channels: 256 Layer: Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 256, Output Channels: 256 Layer: Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 256, Output Channels: 256 Layer: Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 256, Output Channels: 256 Layer: Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 256, Output Channels: 256 Layer: Conv2d(512, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) Input Channels: 512, Output Channels: 512 Layer: Conv2d(512, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) Input Channels: 512, Output Channels: 512 Layer: Conv2d(1280, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) Input Channels: 1280, Output Channels: 512 Layer: Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 256, Output Channels: 256 Layer: Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 256, Output Channels: 256 Layer: Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 256, Output Channels: 256 Layer: Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 256, Output Channels: 256 Layer: Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 256, Output Channels: 256 Layer: Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 256, Output Channels: 256 Layer: Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) Input Channels: 512, Output Channels: 256 Layer: Conv2d(1024, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) Input Channels: 1024, Output Channels: 512
My deep leaning model, which is converted from a PyTorch model and is pruned with NVIDIA’s ASP (Automatic SParsity) and saved as FP16 ONNX model. This model is then converted with trtexec with sparse option “force” and saved with FP16 precision. But when benchmarking on the A40 GPU no speed latency/throughput in noticed. While I thought this could be due to small batchsize, the same is noticed for other batchsizes: between B1 - B32, after B32 I got out of memory error. More detailed and analysis info is given here