PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
22.26k stars 5.6k forks source link

对MobileNetV1剪枝后,nvgpu推理速度反而不如原版. #39753

Closed Water2style closed 2 years ago

Water2style commented 2 years ago

使用FPGM对MobileNetV1分别剪枝 30%和50%.

使用paddle-inference推理预测1万张ImageNet2012图片. 计时位置是 predictor.run()的前后.

只有50%剪枝后的模型推理速度比原版强. 30%的剪枝模型还比原版弱了. 这是为什么呢?谢谢

paddle-bot-old[bot] commented 2 years ago

您好,我们已经收到了您的问题,会安排技术人员尽快解答您的问题,请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时,您也可以通过查看官网API文档常见问题历史IssueAI社区来寻求解答。祝您生活愉快~

Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the APIFAQGithub Issue and AI community to get the answer.Have a nice day!

qili93 commented 2 years ago

hi,您好,

请问您这里是使用PaddleSlim进行剪枝吗?是否方便提供详细的剪枝步骤,以及剪枝之后的模型结构以便进行分析和复现吗?

Water2style commented 2 years ago

hi,您好,

请问您这里是使用PaddleSlim进行剪枝吗?是否方便提供详细的剪枝步骤,以及剪枝之后的模型结构以便进行分析和复现吗?

你好,是用的PaddleSlim. 步骤就是跟随教程来的. 推理的时候,只有batch_size=2的时候,30%的剪枝模型会快于原版,其他更大的batch_size都不如原版 主要步骤: FPGM+敏感性分析+跳过最后一个conv. (50%剪枝和30%剪枝唯一区别就是剪枝比例)

pruner = paddleslim.dygraph.FPGMFilterPruner(net, [1, 3, 224, 224]) pruner.sensitive(sen_file=args.sens_file) plan = pruner.sensitive_prune(args.prune_ratio, skip_vars=[ conv2d_params_name[-1]]).

您说的剪枝后的模型结构是 Summary信息吗?~

qili93 commented 2 years ago

是的,最好可以帮忙提供一下剪枝前后的模型变化信息,谢谢!

Water2style commented 2 years ago

是的,最好可以帮忙提供一下剪枝前后的模型变化信息,谢谢!

网络结构用的是 PaddleClas.MobileNetV1

这是剪枝算法print出来的最后的一部分,应该是展示了裁剪了哪些通道: 2022-02-19 03:59:47,031-INFO: change groups from 32 to 16 for conv2d_1.w_0. 2022-02-19 03:59:47,034-INFO: change groups from 64 to 46 for conv2d_3.w_0. 2022-02-19 03:59:47,037-INFO: change groups from 128 to 101 for conv2d_5.w_0. 2022-02-19 03:59:47,041-INFO: change groups from 128 to 99 for conv2d_7.w_0. 2022-02-19 03:59:47,045-INFO: change groups from 256 to 224 for conv2d_9.w_0. 2022-02-19 03:59:47,070-INFO: change groups from 256 to 174 for conv2d_11.w_0. 2022-02-19 03:59:47,076-INFO: change groups from 512 to 432 for conv2d_13.w_0. 2022-02-19 03:59:47,085-INFO: change groups from 512 to 417 for conv2d_15.w_0. 2022-02-19 03:59:47,093-INFO: change groups from 512 to 440 for conv2d_17.w_0. 2022-02-19 03:59:47,103-INFO: change groups from 512 to 438 for conv2d_23.w_0. 2022-02-19 03:59:47,116-INFO: change groups from 1024 to 693 for conv2d_25.w_0. FLOPs after pruning: 8253581.0 Pruned FLOPs: 30.01%

这是剪枝后的summary

    Layer (type)          Input Shape          Output Shape         Param #    

      Conv2D-1         [[1, 3, 224, 224]]   [1, 16, 112, 112]         432      
     BatchNorm-1      [[1, 16, 112, 112]]   [1, 16, 112, 112]         64       
       ReLU-5         [[1, 16, 112, 112]]   [1, 16, 112, 112]          0       
    ConvBNLayer-1      [[1, 3, 224, 224]]   [1, 16, 112, 112]          0       
      Conv2D-2        [[1, 16, 112, 112]]   [1, 16, 112, 112]         144      
     BatchNorm-2      [[1, 16, 112, 112]]   [1, 16, 112, 112]         64       
       ReLU-6         [[1, 16, 112, 112]]   [1, 16, 112, 112]          0       
    ConvBNLayer-2     [[1, 16, 112, 112]]   [1, 16, 112, 112]          0       
      Conv2D-3        [[1, 16, 112, 112]]   [1, 46, 112, 112]         736      
     BatchNorm-3      [[1, 46, 112, 112]]   [1, 46, 112, 112]         184      
       ReLU-7         [[1, 46, 112, 112]]   [1, 46, 112, 112]          0       
    ConvBNLayer-3     [[1, 16, 112, 112]]   [1, 46, 112, 112]          0
DepthwiseSeparable-1  [[1, 16, 112, 112]]   [1, 46, 112, 112]          0       
      Conv2D-4        [[1, 46, 112, 112]]    [1, 46, 56, 56]          414      
     BatchNorm-4       [[1, 46, 56, 56]]     [1, 46, 56, 56]          184      
       ReLU-8          [[1, 46, 56, 56]]     [1, 46, 56, 56]           0       
    ConvBNLayer-4     [[1, 46, 112, 112]]    [1, 46, 56, 56]           0       
      Conv2D-5         [[1, 46, 56, 56]]     [1, 101, 56, 56]        4,646     
     BatchNorm-5       [[1, 101, 56, 56]]    [1, 101, 56, 56]         404      
       ReLU-9          [[1, 101, 56, 56]]    [1, 101, 56, 56]          0       
    ConvBNLayer-5      [[1, 46, 56, 56]]     [1, 101, 56, 56]          0       
DepthwiseSeparable-2  [[1, 46, 112, 112]]    [1, 101, 56, 56]          0       
      Conv2D-6         [[1, 101, 56, 56]]    [1, 101, 56, 56]         909      
     BatchNorm-6       [[1, 101, 56, 56]]    [1, 101, 56, 56]         404      
       ReLU-10         [[1, 101, 56, 56]]    [1, 101, 56, 56]          0       
    ConvBNLayer-6      [[1, 101, 56, 56]]    [1, 101, 56, 56]          0       
      Conv2D-7         [[1, 101, 56, 56]]    [1, 99, 56, 56]         9,999     
     BatchNorm-7       [[1, 99, 56, 56]]     [1, 99, 56, 56]          396      
       ReLU-11         [[1, 99, 56, 56]]     [1, 99, 56, 56]           0       
    ConvBNLayer-7      [[1, 101, 56, 56]]    [1, 99, 56, 56]           0       
DepthwiseSeparable-3   [[1, 101, 56, 56]]    [1, 99, 56, 56]           0       
      Conv2D-8         [[1, 99, 56, 56]]     [1, 99, 28, 28]          891      
     BatchNorm-8       [[1, 99, 28, 28]]     [1, 99, 28, 28]          396      
       ReLU-12         [[1, 99, 28, 28]]     [1, 99, 28, 28]           0       
    ConvBNLayer-8      [[1, 99, 56, 56]]     [1, 99, 28, 28]           0       
      Conv2D-9         [[1, 99, 28, 28]]     [1, 224, 28, 28]       22,176     
     BatchNorm-9       [[1, 224, 28, 28]]    [1, 224, 28, 28]         896      
       ReLU-13         [[1, 224, 28, 28]]    [1, 224, 28, 28]          0       
    ConvBNLayer-9      [[1, 99, 28, 28]]     [1, 224, 28, 28]          0       
DepthwiseSeparable-4   [[1, 99, 56, 56]]     [1, 224, 28, 28]          0       
      Conv2D-10        [[1, 224, 28, 28]]    [1, 224, 28, 28]        2,016     
    BatchNorm-10       [[1, 224, 28, 28]]    [1, 224, 28, 28]         896      
       ReLU-14         [[1, 224, 28, 28]]    [1, 224, 28, 28]          0       
   ConvBNLayer-10      [[1, 224, 28, 28]]    [1, 224, 28, 28]          0       
      Conv2D-11        [[1, 224, 28, 28]]    [1, 174, 28, 28]       38,976     
    BatchNorm-11       [[1, 174, 28, 28]]    [1, 174, 28, 28]         696      
       ReLU-15         [[1, 174, 28, 28]]    [1, 174, 28, 28]          0       
   ConvBNLayer-11      [[1, 224, 28, 28]]    [1, 174, 28, 28]          0       
DepthwiseSeparable-5   [[1, 224, 28, 28]]    [1, 174, 28, 28]          0       
      Conv2D-12        [[1, 174, 28, 28]]    [1, 174, 14, 14]        1,566     
    BatchNorm-12       [[1, 174, 14, 14]]    [1, 174, 14, 14]         696      
       ReLU-16         [[1, 174, 14, 14]]    [1, 174, 14, 14]          0       
   ConvBNLayer-12      [[1, 174, 28, 28]]    [1, 174, 14, 14]          0       
      Conv2D-13        [[1, 174, 14, 14]]    [1, 432, 14, 14]       75,168     
    BatchNorm-13       [[1, 432, 14, 14]]    [1, 432, 14, 14]        1,728     
       ReLU-17         [[1, 432, 14, 14]]    [1, 432, 14, 14]          0       
   ConvBNLayer-13      [[1, 174, 14, 14]]    [1, 432, 14, 14]          0       
DepthwiseSeparable-6   [[1, 174, 28, 28]]    [1, 432, 14, 14]          0       
      Conv2D-14        [[1, 432, 14, 14]]    [1, 432, 14, 14]        3,888     
    BatchNorm-14       [[1, 432, 14, 14]]    [1, 432, 14, 14]        1,728     
       ReLU-18         [[1, 432, 14, 14]]    [1, 432, 14, 14]          0       
   ConvBNLayer-14      [[1, 432, 14, 14]]    [1, 432, 14, 14]          0       
      Conv2D-15        [[1, 432, 14, 14]]    [1, 417, 14, 14]       180,144    
    BatchNorm-15       [[1, 417, 14, 14]]    [1, 417, 14, 14]        1,668     
       ReLU-19         [[1, 417, 14, 14]]    [1, 417, 14, 14]          0       
   ConvBNLayer-15      [[1, 432, 14, 14]]    [1, 417, 14, 14]          0       
DepthwiseSeparable-7   [[1, 432, 14, 14]]    [1, 417, 14, 14]          0       
      Conv2D-16        [[1, 417, 14, 14]]    [1, 417, 14, 14]        3,753     
    BatchNorm-16       [[1, 417, 14, 14]]    [1, 417, 14, 14]        1,668     
       ReLU-20         [[1, 417, 14, 14]]    [1, 417, 14, 14]          0       
   ConvBNLayer-16      [[1, 417, 14, 14]]    [1, 417, 14, 14]          0       
      Conv2D-17        [[1, 417, 14, 14]]    [1, 440, 14, 14]       183,480    
    BatchNorm-17       [[1, 440, 14, 14]]    [1, 440, 14, 14]        1,760     
       ReLU-21         [[1, 440, 14, 14]]    [1, 440, 14, 14]          0       
   ConvBNLayer-17      [[1, 417, 14, 14]]    [1, 440, 14, 14]          0       
DepthwiseSeparable-8   [[1, 417, 14, 14]]    [1, 440, 14, 14]          0       
      Conv2D-18        [[1, 440, 14, 14]]    [1, 440, 14, 14]        3,960     
    BatchNorm-18       [[1, 440, 14, 14]]    [1, 440, 14, 14]        1,760     
       ReLU-22         [[1, 440, 14, 14]]    [1, 440, 14, 14]          0       
   ConvBNLayer-18      [[1, 440, 14, 14]]    [1, 440, 14, 14]          0       
      Conv2D-19        [[1, 440, 14, 14]]    [1, 512, 14, 14]       225,280    
    BatchNorm-19       [[1, 512, 14, 14]]    [1, 512, 14, 14]        2,048     
       ReLU-23         [[1, 512, 14, 14]]    [1, 512, 14, 14]          0       
   ConvBNLayer-19      [[1, 440, 14, 14]]    [1, 512, 14, 14]          0       
DepthwiseSeparable-9   [[1, 440, 14, 14]]    [1, 512, 14, 14]          0       
      Conv2D-20        [[1, 512, 14, 14]]    [1, 512, 14, 14]        4,608     
    BatchNorm-20       [[1, 512, 14, 14]]    [1, 512, 14, 14]        2,048     
       ReLU-24         [[1, 512, 14, 14]]    [1, 512, 14, 14]          0       
   ConvBNLayer-20      [[1, 512, 14, 14]]    [1, 512, 14, 14]          0       
      Conv2D-21        [[1, 512, 14, 14]]    [1, 512, 14, 14]       262,144    
    BatchNorm-21       [[1, 512, 14, 14]]    [1, 512, 14, 14]        2,048     
       ReLU-25         [[1, 512, 14, 14]]    [1, 512, 14, 14]          0       
   ConvBNLayer-21      [[1, 512, 14, 14]]    [1, 512, 14, 14]          0       
DepthwiseSeparable-10  [[1, 512, 14, 14]]    [1, 512, 14, 14]          0       
      Conv2D-22        [[1, 512, 14, 14]]    [1, 512, 14, 14]        4,608     
    BatchNorm-22       [[1, 512, 14, 14]]    [1, 512, 14, 14]        2,048     
       ReLU-26         [[1, 512, 14, 14]]    [1, 512, 14, 14]          0       
   ConvBNLayer-22      [[1, 512, 14, 14]]    [1, 512, 14, 14]          0       
      Conv2D-23        [[1, 512, 14, 14]]    [1, 438, 14, 14]       224,256    
    BatchNorm-23       [[1, 438, 14, 14]]    [1, 438, 14, 14]        1,752     
       ReLU-27         [[1, 438, 14, 14]]    [1, 438, 14, 14]          0       
   ConvBNLayer-23      [[1, 512, 14, 14]]    [1, 438, 14, 14]          0       
DepthwiseSeparable-11  [[1, 512, 14, 14]]    [1, 438, 14, 14]          0       
      Conv2D-24        [[1, 438, 14, 14]]     [1, 438, 7, 7]         3,942     
    BatchNorm-24        [[1, 438, 7, 7]]      [1, 438, 7, 7]         1,752     
       ReLU-28          [[1, 438, 7, 7]]      [1, 438, 7, 7]           0       
   ConvBNLayer-24      [[1, 438, 14, 14]]     [1, 438, 7, 7]           0       
      Conv2D-25         [[1, 438, 7, 7]]      [1, 693, 7, 7]        303,534    
    BatchNorm-25        [[1, 693, 7, 7]]      [1, 693, 7, 7]         2,772     
       ReLU-29          [[1, 693, 7, 7]]      [1, 693, 7, 7]           0       
   ConvBNLayer-25       [[1, 438, 7, 7]]      [1, 693, 7, 7]           0       
DepthwiseSeparable-12  [[1, 438, 14, 14]]     [1, 693, 7, 7]           0       
      Conv2D-26         [[1, 693, 7, 7]]      [1, 693, 7, 7]         6,237     
    BatchNorm-26        [[1, 693, 7, 7]]      [1, 693, 7, 7]         2,772     
       ReLU-30          [[1, 693, 7, 7]]      [1, 693, 7, 7]           0       
   ConvBNLayer-26       [[1, 693, 7, 7]]      [1, 693, 7, 7]           0       
      Conv2D-27         [[1, 693, 7, 7]]     [1, 1024, 7, 7]        709,632    
    BatchNorm-27       [[1, 1024, 7, 7]]     [1, 1024, 7, 7]         4,096     
       ReLU-31         [[1, 1024, 7, 7]]     [1, 1024, 7, 7]           0       
   ConvBNLayer-27       [[1, 693, 7, 7]]     [1, 1024, 7, 7]           0       
DepthwiseSeparable-13   [[1, 693, 7, 7]]     [1, 1024, 7, 7]           0       
 AdaptiveAvgPool2D-1   [[1, 1024, 7, 7]]     [1, 1024, 1, 1]           0       
      Flatten-1        [[1, 1024, 1, 1]]        [1, 1024]              0       
      Linear-1            [[1, 1024]]           [1, 1000]          1,025,000   

Total params: 3,339,467
Trainable params: 3,302,539
Non-trainable params: 36,928
---------------------------------------------------------------------------------
Input size (MB): 0.57
Forward/backward pass size (MB): 132.26
Params size (MB): 12.74
Estimated Total Size (MB): 145.57
wanghaoshuang commented 2 years ago

只有50%剪枝后的模型推理速度比原版强. 30%的剪枝模型还比原版弱了. 这是为什么呢?谢谢

这是因为PaddleInference所调用的Nvidia GPU计算库cublas(或cublaslt)对矩阵乘或卷积计算的操作做了深入的优化。 比如,如果对A=[1024,1024], B=[1024, 256]做了特殊优化O,将A剪裁为[700, 1024], 有可能无法命中优化O, 性能反而不如剪裁之前。

可以分别尝试以下两个方法,来尽量避免上述问题:

  1. 在敏感度分析时,开启sensitive_prunealign选项,使剪裁后的通道数是8或16的倍数;
  2. 在当前得到的一组剪裁率的基础上,微调剪裁率,使剪裁后的通道数是8或16的倍数。

以上,在Nvidia GPU上剪裁的推理加速确实没有Intel CPU和ARM CPU来的容易,需要多做一些工作。

Water2style commented 2 years ago

只有50%剪枝后的模型推理速度比原版强. 30%的剪枝模型还比原版弱了. 这是为什么呢?谢谢

这是因为PaddleInference所调用的Nvidia GPU计算库cublas(或cublaslt)对矩阵乘或卷积计算的操作做了深入的优化。 比如,如果对A=[1024,1024], B=[1024, 256]做了特殊优化O,将A剪裁为[700, 1024], 有可能无法命中优化O, 性能反而不如剪裁之前。

可以分别尝试以下两个方法,来尽量避免上述问题:

  1. 在敏感度分析时,开启sensitive_prunealign选项,使剪裁后的通道数是8或16的倍数;
  2. 在当前得到的一组剪裁率的基础上,微调剪裁率,使剪裁后的通道数是8或16的倍数。

以上,在Nvidia GPU上剪裁的推理加速确实没有Intel CPU和ARM CPU来的容易,需要多做一些工作。

好的谢谢! 我后面继续做实验试试