Tencent / ncnn

ncnn is a high-performance neural network inference framework optimized for the mobile platform
Other
19.68k stars 4.1k forks source link

Vulkan + Raspberry Pi 4 disappointing #2435

Open Qengineering opened 3 years ago

Qengineering commented 3 years ago

I have done some testing with the latest Vulkan drivers on a Raspberry Pi 4 (64-OS). Knowing the driver is still under construction, the results were a great disappointment. No acceleration at all, it was even 5 times slower than without the Vulkan support. Just to let you know.

Native build on Raspberry Pi 4 64-OS, 1500 MHz, 2 GB RAM, 128 MB GPU RAM. Without Vulkan cmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/aarch64-linux-gnu.toolchain.cmake -DCMAKE_BUILD_TYPE=Release .. loop_count = 4 num_threads = 4 powersave = 0 gpu_device = -1 cooling_down = 1 squeezenet min = 63.55 max = 70.05 avg = 65.99 squeezenet_int8 min = 65.72 max = 66.05 avg = 65.84 mobilenet min = 71.39 max = 72.78 avg = 71.86 mobilenet_int8 min = 97.65 max = 129.53 avg = 109.87 mobilenet_v2 min = 71.24 max = 73.68 avg = 72.20 mobilenet_v3 min = 55.79 max = 56.13 avg = 55.93 shufflenet min = 39.25 max = 40.74 avg = 40.06 shufflenet_v2 min = 28.75 max = 29.28 avg = 29.06 mnasnet min = 60.31 max = 61.11 avg = 60.74 proxylessnasnet min = 62.64 max = 77.77 avg = 69.12 efficientnet_b0 min = 93.49 max = 94.29 avg = 93.88 regnety_400m min = 76.97 max = 78.11 avg = 77.55 blazeface min = 13.02 max = 13.26 avg = 13.17 googlenet min = 168.00 max = 190.48 avg = 174.87 googlenet_int8 min = 147.13 max = 207.46 avg = 162.40 resnet18 min = 222.98 max = 231.52 avg = 225.69 resnet18_int8 min = 133.61 max = 145.16 avg = 136.70 alexnet min = 169.34 max = 174.96 avg = 171.05 vgg16 min = 910.35 max = 956.36 avg = 930.93 vgg16_int8 min = 1242.82 max = 1309.72 avg = 1282.35 resnet50 min = 408.64 max = 425.08 avg = 414.09 resnet50_int8 min = 288.59 max = 291.54 avg = 290.26 squeezenet_ssd min = 181.44 max = 182.54 avg = 182.12 squeezenet_ssd_int8 min = 185.94 max = 187.68 avg = 186.83 mobilenet_ssd min = 143.34 max = 143.58 avg = 143.43 mobilenet_ssd_int8 min = 156.11 max = 157.47 avg = 156.51 mobilenet_yolo min = 322.27 max = 351.88 avg = 331.43 mobilenetv2_yolov3 min = 218.17 max = 219.72 avg = 218.87 yolov4-tiny min = 313.92 max = 326.21 avg = 317.19

With Vulkan cmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/aarch64-linux-gnu.toolchain.cmake -DNCNN_VULKAN=ON -DCMAKE_BUILD_TYPE=Release .. pi@raspberrypi:~/ncnn/benchmark $ ./benchncnn 4 4 0 0 [0 V3D 4.2] queueC=0[1] queueG=0[1] queueT=0[1] [0 V3D 4.2] bugsbn1=0 bugcopc=0 bugihfa=0 [0 V3D 4.2] fp16p=1 fp16s=0 fp16a=0 int8s=0 int8a=0 [0 V3D 4.2] subgroup=3291716400 basic=0 vote=0 ballot=1 shuffle=0 loop_count = 4 num_threads = 4 powersave = 0 gpu_device = 0 cooling_down = 1 squeezenet min = 346.48 max = 347.38 avg = 346.78 squeezenet_int8 min = 64.67 max = 65.59 avg = 65.18 mobilenet min = 556.58 max = 559.89 avg = 558.20 mobilenet_int8 min = 91.91 max = 93.89 avg = 92.65 mobilenet_v2 min = 381.82 max = 382.65 avg = 382.22 mobilenet_v3 min = 342.35 max = 343.42 avg = 342.73 shufflenet min = 409.33 max = 410.00 avg = 409.59 shufflenet_v2 min = 302.26 max = 305.00 avg = 304.08 mnasnet min = 397.13 max = 397.64 avg = 397.31 proxylessnasnet min = 413.21 max = 413.79 avg = 413.57 efficientnet_b0 min = 559.96 max = 560.99 avg = 560.32 regnety_400m min = 482.12 max = 483.13 avg = 482.59 blazeface min = 76.94 max = 77.10 avg = 77.01 googlenet min = 1121.36 max = 1126.17 avg = 1124.12 googlenet_int8 min = 150.09 max = 150.63 avg = 150.30 resnet18 min = 1084.91 max = 1086.17 avg = 1085.51 resnet18_int8 min = 143.80 max = 152.30 avg = 146.06 alexnet min = 2002.00 max = 2121.92 avg = 2059.23 vgg16 min = 7205.38 max = 7257.74 avg = 7226.90 vgg16_int8 min = 1245.08 max = 1273.66 avg = 1263.44 resnet50 min = 3306.48 max = 3322.29 avg = 3311.11 resnet50_int8 min = 296.10 max = 297.80 avg = 296.92 squeezenet_ssd min = 1717.27 max = 1721.34 avg = 1719.36 squeezenet_ssd_int8 min = 197.46 max = 205.67 avg = 202.49 mobilenet_ssd min = 1396.28 max = 1401.41 avg = 1399.47 mobilenet_ssd_int8 min = 152.84 max = 153.95 avg = 153.55 mobilenet_yolo min = 3071.84 max = 3073.80 avg = 3072.84 mobilenetv2_yolov3 min = 1370.07 max = 1370.98 avg = 1370.47 yolov4-tiny min = 2241.63 max = 2242.32 avg = 2241.93

I did the same test with a Jetson Nano and, surprise surprise, the Vulkan acceleration works excellently! Native build on Jetson Nano, 2014 MHz, 4 GB RAM. Without Vulkan cmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/jetson.toolchain.cmake -DNCNN_VULKAN=ON -DCMAKE_BUILD_TYPE=Release .. loop_count = 4 num_threads = 4 powersave = 0 gpu_device = -1 cooling_down = 1 squeezenet min = 53.27 max = 65.17 avg = 57.00 squeezenet_int8 min = 28.36 max = 29.96 avg = 28.95 mobilenet min = 32.56 max = 32.71 avg = 32.63 mobilenet_int8 min = 44.90 max = 45.64 avg = 45.30 mobilenet_v2 min = 26.76 max = 26.94 avg = 26.85 mobilenet_v3 min = 24.14 max = 27.46 avg = 25.31 shufflenet min = 19.61 max = 35.16 avg = 27.86 shufflenet_v2 min = 17.97 max = 99.59 avg = 44.58 mnasnet min = 25.50 max = 43.91 avg = 34.78 proxylessnasnet min = 29.56 max = 36.27 avg = 32.65 efficientnet_b0 min = 54.38 max = 182.29 avg = 90.53 regnety_400m min = 43.64 max = 46.23 avg = 45.26 blazeface min = 6.11 max = 6.46 avg = 6.28 googlenet min = 83.42 max = 88.92 avg = 85.36 googlenet_int8 min = 94.54 max = 123.76 avg = 102.77 resnet18 min = 92.82 max = 166.32 avg = 128.70 resnet18_int8 min = 90.29 max = 100.16 avg = 94.18 alexnet min = 139.70 max = 160.68 avg = 147.90 vgg16 min = 464.18 max = 687.42 avg = 548.92 vgg16_int8 min = 715.58 max = 809.26 avg = 748.51 resnet50 min = 192.21 max = 311.36 avg = 226.71 resnet50_int8 min = 181.12 max = 235.10 avg = 206.01 squeezenet_ssd min = 77.15 max = 103.62 avg = 85.95 squeezenet_ssd_int8 min = 88.66 max = 157.42 avg = 118.41 mobilenet_ssd min = 73.25 max = 162.26 avg = 103.62 mobilenet_ssd_int8 min = 81.04 max = 186.65 avg = 126.86 mobilenet_yolo min = 161.90 max = 255.14 avg = 199.35 mobilenetv2_yolov3 min = 96.22 max = 166.11 avg = 130.65 yolov4-tiny min = 140.02 max = 235.53 avg = 169.60 With Vulkan jetson@nano:~/ncnn/benchmark $ ./benchncnn 4 4 0 0 [0 NVIDIA Tegra X1 (nvgpu)] queueC=0[16] queueG=0[16] queueT=0[16] [0 NVIDIA Tegra X1 (nvgpu)] bugsbn1=0 buglbia=0 bugcopc=0 bugihfa=0 [0 NVIDIA Tegra X1 (nvgpu)] fp16p=1 fp16s=1 fp16a=1 int8s=1 int8a=1 [0 NVIDIA Tegra X1 (nvgpu)] subgroup=32 basic=1 vote=1 ballot=1 shuffle=1 loop_count = 4 num_threads = 4 powersave = 0 gpu_device = 0 cooling_down = 1 squeezenet min = 13.92 max = 14.07 avg = 14.01 squeezenet_int8 min = 28.52 max = 104.22 avg = 60.80 mobilenet min = 15.84 max = 16.12 avg = 15.95 mobilenet_int8 min = 46.22 max = 136.99 avg = 70.47 mobilenet_v2 min = 24.48 max = 30.38 avg = 27.81 mobilenet_v3 min = 14.29 max = 22.23 avg = 19.77 shufflenet min = 13.96 max = 14.79 avg = 14.40 shufflenet_v2 min = 23.82 max = 24.62 avg = 24.12 mnasnet min = 18.31 max = 22.95 avg = 19.70 proxylessnasnet min = 14.27 max = 14.87 avg = 14.52 efficientnet_b0 min = 31.57 max = 33.03 avg = 32.29 regnety_400m min = 17.01 max = 26.45 avg = 22.24 blazeface min = 7.38 max = 9.52 avg = 8.47 googlenet min = 41.30 max = 46.55 avg = 43.79 googlenet_int8 min = 95.72 max = 191.92 avg = 120.60 resnet18 min = 44.35 max = 46.20 avg = 45.00 resnet18_int8 min = 90.24 max = 112.45 avg = 96.23 alexnet min = 72.30 max = 74.79 avg = 73.68 vgg16 min = 295.22 max = 298.62 avg = 296.83 vgg16_int8 min = 727.67 max = 762.81 avg = 739.81 resnet50 min = 88.11 max = 94.38 avg = 92.03 resnet50_int8 min = 183.78 max = 288.11 avg = 217.60 squeezenet_ssd min = 53.65 max = 63.92 avg = 57.87 squeezenet_ssd_int8 min = 88.88 max = 193.78 avg = 120.06 mobilenet_ssd min = 36.03 max = 40.12 avg = 37.50 mobilenet_ssd_int8 min = 78.67 max = 188.17 avg = 106.49 mobilenet_yolo min = 74.52 max = 80.38 avg = 76.88 mobilenetv2_yolov3 min = 48.19 max = 51.39 avg = 49.80 yolov4-tiny min = 88.64 max = 96.50 avg = 92.92 nvdc: start nvdcEventThread nvdc: exit nvdcEventThread

nihui commented 3 years ago

Thanks ! I think it is because the vulkan driver is still not mature enough and does not do a good job of shader optimization.

zylo117 commented 3 years ago

raspberry 4b gpu videocore6 has only 32gflops, while jetson nano reaches 472 gflops, way better than raspi4b. Of course it's disappointing. https://www.cpu-monkey.com/en/igpu-broadcom_videocore_vi-221 https://developer.nvidia.com/embedded/jetson-modules

Qengineering commented 3 years ago

@zylo117 You're missing the point here. It is not a comparison between RPi and Nano. The RPi has a lower FPS with Vulkan than without. To make sure that the Vulkan mechanism is working properly, the same test is performed with the Nano. As you expected, the FPS are now much higher with Vulkan than without. So the algorithm works well, but the RPi lacks good drivers so far, as @nihui also indicates.

zylo117 commented 3 years ago

But it's hard to tell whether it's because of immature driver or due the poor performance of the gpu.

If it's the latter, which can be observered from the results you posted and their flops gap, vulkan can't help making the inference any faster.

Mek101 commented 11 months ago

Has the situation changed?

Qengineering commented 11 months ago

No, not yet. As long as Vulkan drivers for the Raspberry Pi lack the 16-bit floating point or 8-bit integers, it won't be faster than a CPU-only version.

zhengpeirong commented 1 month ago

Raspberry Pi OS Now Shipping With Vulkan Support By Default.

Has the situation changed?

Qengineering commented 1 month ago

Sadly not. Despite the incorporated Vulkan engine, you still get poor results. See for yourself.

pi@raspberrypi:~/ncnn/benchmark $ hostnamectl
 Static hostname: raspberrypi
       Icon name: computer
      Machine ID: 072da82a1b314b32824f766429af0208
         Boot ID: 9f0761b989fb405099fa9c28c8443253
Operating System: Debian GNU/Linux 12 (bookworm)  
          Kernel: Linux 6.6.28+rpt-rpi-2712
    Architecture: arm64

pi@raspberrypi:~/ncnn/benchmark $ ./benchncnn 10 4 0 0 -1 >> text.out
[0 V3D 7.1.7]  queueC=0[1]  queueG=0[1]  queueT=0[1]
[0 V3D 7.1.7]  bugsbn1=0  bugbilz=0  bugcopc=0  bugihfa=0
[0 V3D 7.1.7]  fp16-p/s/u/a=1/1/1/0  int8-p/s/u/a=1/1/1/0
[0 V3D 7.1.7]  subgroup=16  basic/vote/ballot/shuffle=1/0/0/0
[0 V3D 7.1.7]  fp16-8x8x16/16x8x8/16x8x16/16x16x16=0/0/0/0
[1 llvmpipe (LLVM 15.0.6, 128 bits)]  queueC=0[1]  queueG=0[1]  queueT=0[1]
[1 llvmpipe (LLVM 15.0.6, 128 bits)]  bugsbn1=0  bugbilz=0  bugcopc=0  bugihfa=0
[1 llvmpipe (LLVM 15.0.6, 128 bits)]  fp16-p/s/u/a=1/1/1/1  int8-p/s/u/a=1/1/1/1
[1 llvmpipe (LLVM 15.0.6, 128 bits)]  subgroup=4  basic/vote/ballot/shuffle=1/1/1/1
[1 llvmpipe (LLVM 15.0.6, 128 bits)]  fp16-8x8x16/16x8x8/16x8x16/16x16x16=0/0/0/0
loop_count = 10
num_threads = 4
powersave = 0
gpu_device = 0
cooling_down = 1
          squeezenet  min =  123.29  max =  123.66  avg =  123.38
     squeezenet_int8  min =    8.95  max =   10.09  avg =    9.26
           mobilenet  min =  169.60  max =  169.98  avg =  169.70
      mobilenet_int8  min =   10.11  max =   10.51  avg =   10.33
        mobilenet_v2  min =  126.81  max =  127.42  avg =  126.98
        mobilenet_v3  min =  118.35  max =  118.57  avg =  118.44
          shufflenet  min =   69.42  max =   70.19  avg =   69.73
       shufflenet_v2  min =   92.57  max =   92.76  avg =   92.63
             mnasnet  min =  122.23  max =  122.64  avg =  122.38
     proxylessnasnet  min =  124.49  max =  139.24  avg =  126.68
     efficientnet_b0  min =  195.96  max =  196.58  avg =  196.14
   efficientnetv2_b0  min =  269.41  max =  282.63  avg =  270.95
        regnety_400m  min =  148.02  max =  148.56  avg =  148.22
           blazeface  min =   25.97  max =   26.13  avg =   26.02
           googlenet  min =  344.31  max =  344.91  avg =  344.65
      googlenet_int8  min =   29.68  max =   30.26  avg =   30.04
            resnet18  min =  349.19  max =  349.74  avg =  349.42
       resnet18_int8  min =   20.66  max =   21.09  avg =   20.91
             alexnet  min =  231.89  max =  232.68  avg =  232.37
               vgg16  min = 1797.39  max = 1797.89  avg = 1797.62
          vgg16_int8  min =  117.45  max =  132.17  avg =  120.69
            resnet50  min =  866.06  max =  866.79  avg =  866.48
       resnet50_int8  min =   52.63  max =   66.31  avg =   54.28
      squeezenet_ssd  min =  454.37  max =  458.77  avg =  457.84
 squeezenet_ssd_int8  min =   32.36  max =   33.49  avg =   32.89
       mobilenet_ssd  min =  395.43  max =  398.47  avg =  397.07
  mobilenet_ssd_int8  min =   24.80  max =   25.68  avg =   25.26
      mobilenet_yolo  min =  814.49  max =  815.71  avg =  815.46
  mobilenetv2_yolov3  min =  417.61  max =  419.13  avg =  418.37
         yolov4-tiny  min =  679.58  max =  680.38  avg =  680.02
           nanodet_m  min =  203.55  max =  206.27  avg =  205.37
    yolo-fastest-1.1  min =  107.43  max =  108.05  avg =  107.62
      yolo-fastestv2  min =   80.27  max =   80.81  avg =   80.40
  vision_transformer  min = 21354.49  max = 21358.72  avg = 21355.78
          FastestDet  min =   84.86  max =   85.31  avg =   84.98

Measured on a Raspberry Pi 5.

Mek101 commented 1 month ago

Sadly not. Despite the incorporated Vulkan engine, you still get poor results. See for yourself.

pi@raspberrypi:~/ncnn/benchmark $ hostnamectl
 Static hostname: raspberrypi
       Icon name: computer
      Machine ID: 072da82a1b314b32824f766429af0208
         Boot ID: 9f0761b989fb405099fa9c28c8443253
Operating System: Debian GNU/Linux 12 (bookworm)  
          Kernel: Linux 6.6.28+rpt-rpi-2712
    Architecture: arm64

pi@raspberrypi:~/ncnn/benchmark $ ./benchncnn 10 4 0 0 -1 >> text.out
[0 V3D 7.1.7]  queueC=0[1]  queueG=0[1]  queueT=0[1]
[0 V3D 7.1.7]  bugsbn1=0  bugbilz=0  bugcopc=0  bugihfa=0
[0 V3D 7.1.7]  fp16-p/s/u/a=1/1/1/0  int8-p/s/u/a=1/1/1/0
[0 V3D 7.1.7]  subgroup=16  basic/vote/ballot/shuffle=1/0/0/0
[0 V3D 7.1.7]  fp16-8x8x16/16x8x8/16x8x16/16x16x16=0/0/0/0
[1 llvmpipe (LLVM 15.0.6, 128 bits)]  queueC=0[1]  queueG=0[1]  queueT=0[1]
[1 llvmpipe (LLVM 15.0.6, 128 bits)]  bugsbn1=0  bugbilz=0  bugcopc=0  bugihfa=0
[1 llvmpipe (LLVM 15.0.6, 128 bits)]  fp16-p/s/u/a=1/1/1/1  int8-p/s/u/a=1/1/1/1
[1 llvmpipe (LLVM 15.0.6, 128 bits)]  subgroup=4  basic/vote/ballot/shuffle=1/1/1/1
[1 llvmpipe (LLVM 15.0.6, 128 bits)]  fp16-8x8x16/16x8x8/16x8x16/16x16x16=0/0/0/0
loop_count = 10
num_threads = 4
powersave = 0
gpu_device = 0
cooling_down = 1
          squeezenet  min =  123.29  max =  123.66  avg =  123.38
     squeezenet_int8  min =    8.95  max =   10.09  avg =    9.26
           mobilenet  min =  169.60  max =  169.98  avg =  169.70
      mobilenet_int8  min =   10.11  max =   10.51  avg =   10.33
        mobilenet_v2  min =  126.81  max =  127.42  avg =  126.98
        mobilenet_v3  min =  118.35  max =  118.57  avg =  118.44
          shufflenet  min =   69.42  max =   70.19  avg =   69.73
       shufflenet_v2  min =   92.57  max =   92.76  avg =   92.63
             mnasnet  min =  122.23  max =  122.64  avg =  122.38
     proxylessnasnet  min =  124.49  max =  139.24  avg =  126.68
     efficientnet_b0  min =  195.96  max =  196.58  avg =  196.14
   efficientnetv2_b0  min =  269.41  max =  282.63  avg =  270.95
        regnety_400m  min =  148.02  max =  148.56  avg =  148.22
           blazeface  min =   25.97  max =   26.13  avg =   26.02
           googlenet  min =  344.31  max =  344.91  avg =  344.65
      googlenet_int8  min =   29.68  max =   30.26  avg =   30.04
            resnet18  min =  349.19  max =  349.74  avg =  349.42
       resnet18_int8  min =   20.66  max =   21.09  avg =   20.91
             alexnet  min =  231.89  max =  232.68  avg =  232.37
               vgg16  min = 1797.39  max = 1797.89  avg = 1797.62
          vgg16_int8  min =  117.45  max =  132.17  avg =  120.69
            resnet50  min =  866.06  max =  866.79  avg =  866.48
       resnet50_int8  min =   52.63  max =   66.31  avg =   54.28
      squeezenet_ssd  min =  454.37  max =  458.77  avg =  457.84
 squeezenet_ssd_int8  min =   32.36  max =   33.49  avg =   32.89
       mobilenet_ssd  min =  395.43  max =  398.47  avg =  397.07
  mobilenet_ssd_int8  min =   24.80  max =   25.68  avg =   25.26
      mobilenet_yolo  min =  814.49  max =  815.71  avg =  815.46
  mobilenetv2_yolov3  min =  417.61  max =  419.13  avg =  418.37
         yolov4-tiny  min =  679.58  max =  680.38  avg =  680.02
           nanodet_m  min =  203.55  max =  206.27  avg =  205.37
    yolo-fastest-1.1  min =  107.43  max =  108.05  avg =  107.62
      yolo-fastestv2  min =   80.27  max =   80.81  avg =   80.40
  vision_transformer  min = 21354.49  max = 21358.72  avg = 21355.78
          FastestDet  min =   84.86  max =   85.31  avg =   84.98

Measured on a Raspberry Pi 5.

This isn't GPU accellerated. llvmpipe is a software renderer

Qengineering commented 1 month ago

This is the ncnn output when you build it with the flag -D NCNN_VULKAN=ON and the the submodules loaded with git submodule update --depth=1 --init.