Deep Learning model in TensorRT with SPARSE layers not accelerating inference speed

Michelvl92 commented 3 months ago

My deep leaning model, which is converted from a PyTorch model and is pruned with NVIDIA’s ASP (Automatic SParsity) and saved as FP16 ONNX model. This model is then converted with trtexec with sparse option “force” and saved with FP16 precision. But when benchmarking on the A40 GPU no speed latency/throughput in noticed. While I thought this could be due to small batchsize, the same is noticed for other batchsizes: between B1 - B32, after B32 I got out of memory error. More detailed and analysis info is given here

nvpohanh commented 3 months ago

Could you share the ONNX model? I don't have the permission to access your files

Also, our experiments showed that sparsity only have benefit when convolution channels are large enough (256 or above).

lix19937 commented 3 months ago

I think sparse conv may not necessarily be faster than dense conv in many cases. TRT did not provide specific names for each strategy/tactics but time cost, only the last picked tactic's impl.

Michelvl92 commented 2 months ago

Thanks for the quick response I have made the links accessible now, but just to be sure here are the links:

Would be nice to understand why inference speed does not improve. If this is due to dimensions that can not accelerate in sparse, then I would love to have a link to documentation. Else would love to know how I can improve inference speed by making use of sparsity.

Michelvl92 commented 2 months ago

@lix19937 @nvpohanh I have open the acces to the models, could you have a look for me?

nvpohanh commented 2 months ago

I checked the models, and most the convs do not have "good" shape. Due to hardware alignment requirement, sparse kernels require larger tile sizes than dense kernels. Therefore, if you really want to get the benefit of sparse kernels, make sure that the input/output channels of the convs are at least 256 and are multiples of 128.

Some numbers of channels of most convs in this model are: 48, 96, 192, 288, 576

Michelvl92 commented 2 months ago

@nvpohanh, thanks for checking. Do you have links to documentation with more details? (I couldn't find it.) What are the technical details of why "input/output channels of the convs are at least 256 and are multiples of 128" for speedup are required?

nvpohanh commented 2 months ago

Those are not hard-coded requirements. They are just based on our past observations regarding sparse vs dense kernel perf.

Michelvl92 commented 1 month ago

@nvpohanh So I did some tests with the backbone of yolov8l (first 20 layers), changing the backbone of the model to the rules you mentioned (as you can see below) "if you really want to get the benefit of sparse kernels, make sure that the input/output channels of the convs are at least 256 and are multiples of 128" this is done in the yolov8l model size (FP16). I also used just to be sure 1024x1024 as input size, and tested multiple batch sizes, but saw almost no difference. What could be the reason for this?

Layer: Conv2d(3, 64, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) Input Channels: 3, Output Channels: 64 Layer: Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) Input Channels: 64, Output Channels: 128 Layer: Conv2d(128, 128, kernel_size=(1, 1), stride=(1, 1), bias=False) Input Channels: 128, Output Channels: 128 Layer: Conv2d(320, 128, kernel_size=(1, 1), stride=(1, 1), bias=False) Input Channels: 320, Output Channels: 128 Layer: Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 64, Output Channels: 64 Layer: Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 64, Output Channels: 64 Layer: Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 64, Output Channels: 64 Layer: Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 64, Output Channels: 64 Layer: Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 64, Output Channels: 64 Layer: Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 64, Output Channels: 64 Layer: Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) Input Channels: 128, Output Channels: 256 Layer: Conv2d(256, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) Input Channels: 256, Output Channels: 256 Layer: Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) Input Channels: 1024, Output Channels: 256 Layer: Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 128, Output Channels: 128 Layer: Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 128, Output Channels: 128 Layer: Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 128, Output Channels: 128 Layer: Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 128, Output Channels: 128 Layer: Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 128, Output Channels: 128 Layer: Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 128, Output Channels: 128 Layer: Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 128, Output Channels: 128 Layer: Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 128, Output Channels: 128 Layer: Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 128, Output Channels: 128 Layer: Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 128, Output Channels: 128 Layer: Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 128, Output Channels: 128 Layer: Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 128, Output Channels: 128 Layer: Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) Input Channels: 256, Output Channels: 512 Layer: Conv2d(512, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) Input Channels: 512, Output Channels: 512 Layer: Conv2d(2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) Input Channels: 256, Output Channels: 256 Layer: Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 256, Output Channels: 256 Layer: Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 256, Output Channels: 256 Layer: Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 256, Output Channels: 256 Layer: Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 256, Output Channels: 256 Layer: Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 256, Output Channels: 256 Layer: Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 256, Output Channels: 256 Layer: Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 256, Output Channels: 256 Layer: Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 256, Output Channels: 256 Layer: Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 256, Output Channels: 256 Layer: Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 256, Output Channels: 256 Layer: Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 256, Output Channels: 256 Layer: Conv2d(512, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) Input Channels: 512, Output Channels: 512 Layer: Conv2d(512, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) Input Channels: 512, Output Channels: 512 Layer: Conv2d(1280, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) Input Channels: 1280, Output Channels: 512 Layer: Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 256, Output Channels: 256 Layer: Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 256, Output Channels: 256 Layer: Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 256, Output Channels: 256 Layer: Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 256, Output Channels: 256 Layer: Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 256, Output Channels: 256 Layer: Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 256, Output Channels: 256 Layer: Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) Input Channels: 512, Output Channels: 256 Layer: Conv2d(1024, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) Input Channels: 1024, Output Channels: 512

NVIDIA / TensorRT

Deep Learning model in TensorRT with SPARSE layers not accelerating inference speed #4065