Open Qengineering opened 3 years ago
Thanks ! I think it is because the vulkan driver is still not mature enough and does not do a good job of shader optimization.
raspberry 4b gpu videocore6 has only 32gflops, while jetson nano reaches 472 gflops, way better than raspi4b. Of course it's disappointing. https://www.cpu-monkey.com/en/igpu-broadcom_videocore_vi-221 https://developer.nvidia.com/embedded/jetson-modules
@zylo117 You're missing the point here. It is not a comparison between RPi and Nano. The RPi has a lower FPS with Vulkan than without. To make sure that the Vulkan mechanism is working properly, the same test is performed with the Nano. As you expected, the FPS are now much higher with Vulkan than without. So the algorithm works well, but the RPi lacks good drivers so far, as @nihui also indicates.
But it's hard to tell whether it's because of immature driver or due the poor performance of the gpu.
If it's the latter, which can be observered from the results you posted and their flops gap, vulkan can't help making the inference any faster.
Has the situation changed?
No, not yet. As long as Vulkan drivers for the Raspberry Pi lack the 16-bit floating point or 8-bit integers, it won't be faster than a CPU-only version.
Raspberry Pi OS Now Shipping With Vulkan Support By Default.
Has the situation changed?
Sadly not. Despite the incorporated Vulkan engine, you still get poor results. See for yourself.
pi@raspberrypi:~/ncnn/benchmark $ hostnamectl
Static hostname: raspberrypi
Icon name: computer
Machine ID: 072da82a1b314b32824f766429af0208
Boot ID: 9f0761b989fb405099fa9c28c8443253
Operating System: Debian GNU/Linux 12 (bookworm)
Kernel: Linux 6.6.28+rpt-rpi-2712
Architecture: arm64
pi@raspberrypi:~/ncnn/benchmark $ ./benchncnn 10 4 0 0 -1 >> text.out
[0 V3D 7.1.7] queueC=0[1] queueG=0[1] queueT=0[1]
[0 V3D 7.1.7] bugsbn1=0 bugbilz=0 bugcopc=0 bugihfa=0
[0 V3D 7.1.7] fp16-p/s/u/a=1/1/1/0 int8-p/s/u/a=1/1/1/0
[0 V3D 7.1.7] subgroup=16 basic/vote/ballot/shuffle=1/0/0/0
[0 V3D 7.1.7] fp16-8x8x16/16x8x8/16x8x16/16x16x16=0/0/0/0
[1 llvmpipe (LLVM 15.0.6, 128 bits)] queueC=0[1] queueG=0[1] queueT=0[1]
[1 llvmpipe (LLVM 15.0.6, 128 bits)] bugsbn1=0 bugbilz=0 bugcopc=0 bugihfa=0
[1 llvmpipe (LLVM 15.0.6, 128 bits)] fp16-p/s/u/a=1/1/1/1 int8-p/s/u/a=1/1/1/1
[1 llvmpipe (LLVM 15.0.6, 128 bits)] subgroup=4 basic/vote/ballot/shuffle=1/1/1/1
[1 llvmpipe (LLVM 15.0.6, 128 bits)] fp16-8x8x16/16x8x8/16x8x16/16x16x16=0/0/0/0
loop_count = 10
num_threads = 4
powersave = 0
gpu_device = 0
cooling_down = 1
squeezenet min = 123.29 max = 123.66 avg = 123.38
squeezenet_int8 min = 8.95 max = 10.09 avg = 9.26
mobilenet min = 169.60 max = 169.98 avg = 169.70
mobilenet_int8 min = 10.11 max = 10.51 avg = 10.33
mobilenet_v2 min = 126.81 max = 127.42 avg = 126.98
mobilenet_v3 min = 118.35 max = 118.57 avg = 118.44
shufflenet min = 69.42 max = 70.19 avg = 69.73
shufflenet_v2 min = 92.57 max = 92.76 avg = 92.63
mnasnet min = 122.23 max = 122.64 avg = 122.38
proxylessnasnet min = 124.49 max = 139.24 avg = 126.68
efficientnet_b0 min = 195.96 max = 196.58 avg = 196.14
efficientnetv2_b0 min = 269.41 max = 282.63 avg = 270.95
regnety_400m min = 148.02 max = 148.56 avg = 148.22
blazeface min = 25.97 max = 26.13 avg = 26.02
googlenet min = 344.31 max = 344.91 avg = 344.65
googlenet_int8 min = 29.68 max = 30.26 avg = 30.04
resnet18 min = 349.19 max = 349.74 avg = 349.42
resnet18_int8 min = 20.66 max = 21.09 avg = 20.91
alexnet min = 231.89 max = 232.68 avg = 232.37
vgg16 min = 1797.39 max = 1797.89 avg = 1797.62
vgg16_int8 min = 117.45 max = 132.17 avg = 120.69
resnet50 min = 866.06 max = 866.79 avg = 866.48
resnet50_int8 min = 52.63 max = 66.31 avg = 54.28
squeezenet_ssd min = 454.37 max = 458.77 avg = 457.84
squeezenet_ssd_int8 min = 32.36 max = 33.49 avg = 32.89
mobilenet_ssd min = 395.43 max = 398.47 avg = 397.07
mobilenet_ssd_int8 min = 24.80 max = 25.68 avg = 25.26
mobilenet_yolo min = 814.49 max = 815.71 avg = 815.46
mobilenetv2_yolov3 min = 417.61 max = 419.13 avg = 418.37
yolov4-tiny min = 679.58 max = 680.38 avg = 680.02
nanodet_m min = 203.55 max = 206.27 avg = 205.37
yolo-fastest-1.1 min = 107.43 max = 108.05 avg = 107.62
yolo-fastestv2 min = 80.27 max = 80.81 avg = 80.40
vision_transformer min = 21354.49 max = 21358.72 avg = 21355.78
FastestDet min = 84.86 max = 85.31 avg = 84.98
Measured on a Raspberry Pi 5.
Sadly not. Despite the incorporated Vulkan engine, you still get poor results. See for yourself.
pi@raspberrypi:~/ncnn/benchmark $ hostnamectl Static hostname: raspberrypi Icon name: computer Machine ID: 072da82a1b314b32824f766429af0208 Boot ID: 9f0761b989fb405099fa9c28c8443253 Operating System: Debian GNU/Linux 12 (bookworm) Kernel: Linux 6.6.28+rpt-rpi-2712 Architecture: arm64 pi@raspberrypi:~/ncnn/benchmark $ ./benchncnn 10 4 0 0 -1 >> text.out [0 V3D 7.1.7] queueC=0[1] queueG=0[1] queueT=0[1] [0 V3D 7.1.7] bugsbn1=0 bugbilz=0 bugcopc=0 bugihfa=0 [0 V3D 7.1.7] fp16-p/s/u/a=1/1/1/0 int8-p/s/u/a=1/1/1/0 [0 V3D 7.1.7] subgroup=16 basic/vote/ballot/shuffle=1/0/0/0 [0 V3D 7.1.7] fp16-8x8x16/16x8x8/16x8x16/16x16x16=0/0/0/0 [1 llvmpipe (LLVM 15.0.6, 128 bits)] queueC=0[1] queueG=0[1] queueT=0[1] [1 llvmpipe (LLVM 15.0.6, 128 bits)] bugsbn1=0 bugbilz=0 bugcopc=0 bugihfa=0 [1 llvmpipe (LLVM 15.0.6, 128 bits)] fp16-p/s/u/a=1/1/1/1 int8-p/s/u/a=1/1/1/1 [1 llvmpipe (LLVM 15.0.6, 128 bits)] subgroup=4 basic/vote/ballot/shuffle=1/1/1/1 [1 llvmpipe (LLVM 15.0.6, 128 bits)] fp16-8x8x16/16x8x8/16x8x16/16x16x16=0/0/0/0 loop_count = 10 num_threads = 4 powersave = 0 gpu_device = 0 cooling_down = 1 squeezenet min = 123.29 max = 123.66 avg = 123.38 squeezenet_int8 min = 8.95 max = 10.09 avg = 9.26 mobilenet min = 169.60 max = 169.98 avg = 169.70 mobilenet_int8 min = 10.11 max = 10.51 avg = 10.33 mobilenet_v2 min = 126.81 max = 127.42 avg = 126.98 mobilenet_v3 min = 118.35 max = 118.57 avg = 118.44 shufflenet min = 69.42 max = 70.19 avg = 69.73 shufflenet_v2 min = 92.57 max = 92.76 avg = 92.63 mnasnet min = 122.23 max = 122.64 avg = 122.38 proxylessnasnet min = 124.49 max = 139.24 avg = 126.68 efficientnet_b0 min = 195.96 max = 196.58 avg = 196.14 efficientnetv2_b0 min = 269.41 max = 282.63 avg = 270.95 regnety_400m min = 148.02 max = 148.56 avg = 148.22 blazeface min = 25.97 max = 26.13 avg = 26.02 googlenet min = 344.31 max = 344.91 avg = 344.65 googlenet_int8 min = 29.68 max = 30.26 avg = 30.04 resnet18 min = 349.19 max = 349.74 avg = 349.42 resnet18_int8 min = 20.66 max = 21.09 avg = 20.91 alexnet min = 231.89 max = 232.68 avg = 232.37 vgg16 min = 1797.39 max = 1797.89 avg = 1797.62 vgg16_int8 min = 117.45 max = 132.17 avg = 120.69 resnet50 min = 866.06 max = 866.79 avg = 866.48 resnet50_int8 min = 52.63 max = 66.31 avg = 54.28 squeezenet_ssd min = 454.37 max = 458.77 avg = 457.84 squeezenet_ssd_int8 min = 32.36 max = 33.49 avg = 32.89 mobilenet_ssd min = 395.43 max = 398.47 avg = 397.07 mobilenet_ssd_int8 min = 24.80 max = 25.68 avg = 25.26 mobilenet_yolo min = 814.49 max = 815.71 avg = 815.46 mobilenetv2_yolov3 min = 417.61 max = 419.13 avg = 418.37 yolov4-tiny min = 679.58 max = 680.38 avg = 680.02 nanodet_m min = 203.55 max = 206.27 avg = 205.37 yolo-fastest-1.1 min = 107.43 max = 108.05 avg = 107.62 yolo-fastestv2 min = 80.27 max = 80.81 avg = 80.40 vision_transformer min = 21354.49 max = 21358.72 avg = 21355.78 FastestDet min = 84.86 max = 85.31 avg = 84.98
Measured on a Raspberry Pi 5.
This isn't GPU accellerated. llvmpipe is a software renderer
This is the ncnn output when you build it with the flag -D NCNN_VULKAN=ON
and the the submodules loaded with git submodule update --depth=1 --init
.
I have done some testing with the latest Vulkan drivers on a Raspberry Pi 4 (64-OS). Knowing the driver is still under construction, the results were a great disappointment. No acceleration at all, it was even 5 times slower than without the Vulkan support. Just to let you know.
Native build on Raspberry Pi 4 64-OS, 1500 MHz, 2 GB RAM, 128 MB GPU RAM. Without Vulkan
cmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/aarch64-linux-gnu.toolchain.cmake -DCMAKE_BUILD_TYPE=Release ..
loop_count = 4 num_threads = 4 powersave = 0 gpu_device = -1 cooling_down = 1 squeezenet min = 63.55 max = 70.05 avg = 65.99 squeezenet_int8 min = 65.72 max = 66.05 avg = 65.84 mobilenet min = 71.39 max = 72.78 avg = 71.86 mobilenet_int8 min = 97.65 max = 129.53 avg = 109.87 mobilenet_v2 min = 71.24 max = 73.68 avg = 72.20 mobilenet_v3 min = 55.79 max = 56.13 avg = 55.93 shufflenet min = 39.25 max = 40.74 avg = 40.06 shufflenet_v2 min = 28.75 max = 29.28 avg = 29.06 mnasnet min = 60.31 max = 61.11 avg = 60.74 proxylessnasnet min = 62.64 max = 77.77 avg = 69.12 efficientnet_b0 min = 93.49 max = 94.29 avg = 93.88 regnety_400m min = 76.97 max = 78.11 avg = 77.55 blazeface min = 13.02 max = 13.26 avg = 13.17 googlenet min = 168.00 max = 190.48 avg = 174.87 googlenet_int8 min = 147.13 max = 207.46 avg = 162.40 resnet18 min = 222.98 max = 231.52 avg = 225.69 resnet18_int8 min = 133.61 max = 145.16 avg = 136.70 alexnet min = 169.34 max = 174.96 avg = 171.05 vgg16 min = 910.35 max = 956.36 avg = 930.93 vgg16_int8 min = 1242.82 max = 1309.72 avg = 1282.35 resnet50 min = 408.64 max = 425.08 avg = 414.09 resnet50_int8 min = 288.59 max = 291.54 avg = 290.26 squeezenet_ssd min = 181.44 max = 182.54 avg = 182.12 squeezenet_ssd_int8 min = 185.94 max = 187.68 avg = 186.83 mobilenet_ssd min = 143.34 max = 143.58 avg = 143.43 mobilenet_ssd_int8 min = 156.11 max = 157.47 avg = 156.51 mobilenet_yolo min = 322.27 max = 351.88 avg = 331.43 mobilenetv2_yolov3 min = 218.17 max = 219.72 avg = 218.87 yolov4-tiny min = 313.92 max = 326.21 avg = 317.19With Vulkan
cmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/aarch64-linux-gnu.toolchain.cmake -DNCNN_VULKAN=ON -DCMAKE_BUILD_TYPE=Release ..
pi@raspberrypi:~/ncnn/benchmark $ ./benchncnn 4 4 0 0 [0 V3D 4.2] queueC=0[1] queueG=0[1] queueT=0[1] [0 V3D 4.2] bugsbn1=0 bugcopc=0 bugihfa=0 [0 V3D 4.2] fp16p=1 fp16s=0 fp16a=0 int8s=0 int8a=0 [0 V3D 4.2] subgroup=3291716400 basic=0 vote=0 ballot=1 shuffle=0 loop_count = 4 num_threads = 4 powersave = 0 gpu_device = 0 cooling_down = 1 squeezenet min = 346.48 max = 347.38 avg = 346.78 squeezenet_int8 min = 64.67 max = 65.59 avg = 65.18 mobilenet min = 556.58 max = 559.89 avg = 558.20 mobilenet_int8 min = 91.91 max = 93.89 avg = 92.65 mobilenet_v2 min = 381.82 max = 382.65 avg = 382.22 mobilenet_v3 min = 342.35 max = 343.42 avg = 342.73 shufflenet min = 409.33 max = 410.00 avg = 409.59 shufflenet_v2 min = 302.26 max = 305.00 avg = 304.08 mnasnet min = 397.13 max = 397.64 avg = 397.31 proxylessnasnet min = 413.21 max = 413.79 avg = 413.57 efficientnet_b0 min = 559.96 max = 560.99 avg = 560.32 regnety_400m min = 482.12 max = 483.13 avg = 482.59 blazeface min = 76.94 max = 77.10 avg = 77.01 googlenet min = 1121.36 max = 1126.17 avg = 1124.12 googlenet_int8 min = 150.09 max = 150.63 avg = 150.30 resnet18 min = 1084.91 max = 1086.17 avg = 1085.51 resnet18_int8 min = 143.80 max = 152.30 avg = 146.06 alexnet min = 2002.00 max = 2121.92 avg = 2059.23 vgg16 min = 7205.38 max = 7257.74 avg = 7226.90 vgg16_int8 min = 1245.08 max = 1273.66 avg = 1263.44 resnet50 min = 3306.48 max = 3322.29 avg = 3311.11 resnet50_int8 min = 296.10 max = 297.80 avg = 296.92 squeezenet_ssd min = 1717.27 max = 1721.34 avg = 1719.36 squeezenet_ssd_int8 min = 197.46 max = 205.67 avg = 202.49 mobilenet_ssd min = 1396.28 max = 1401.41 avg = 1399.47 mobilenet_ssd_int8 min = 152.84 max = 153.95 avg = 153.55 mobilenet_yolo min = 3071.84 max = 3073.80 avg = 3072.84 mobilenetv2_yolov3 min = 1370.07 max = 1370.98 avg = 1370.47 yolov4-tiny min = 2241.63 max = 2242.32 avg = 2241.93I did the same test with a Jetson Nano and, surprise surprise, the Vulkan acceleration works excellently! Native build on Jetson Nano, 2014 MHz, 4 GB RAM. Without Vulkan
cmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/jetson.toolchain.cmake -DNCNN_VULKAN=ON -DCMAKE_BUILD_TYPE=Release ..
loop_count = 4 num_threads = 4 powersave = 0 gpu_device = -1 cooling_down = 1 squeezenet min = 53.27 max = 65.17 avg = 57.00 squeezenet_int8 min = 28.36 max = 29.96 avg = 28.95 mobilenet min = 32.56 max = 32.71 avg = 32.63 mobilenet_int8 min = 44.90 max = 45.64 avg = 45.30 mobilenet_v2 min = 26.76 max = 26.94 avg = 26.85 mobilenet_v3 min = 24.14 max = 27.46 avg = 25.31 shufflenet min = 19.61 max = 35.16 avg = 27.86 shufflenet_v2 min = 17.97 max = 99.59 avg = 44.58 mnasnet min = 25.50 max = 43.91 avg = 34.78 proxylessnasnet min = 29.56 max = 36.27 avg = 32.65 efficientnet_b0 min = 54.38 max = 182.29 avg = 90.53 regnety_400m min = 43.64 max = 46.23 avg = 45.26 blazeface min = 6.11 max = 6.46 avg = 6.28 googlenet min = 83.42 max = 88.92 avg = 85.36 googlenet_int8 min = 94.54 max = 123.76 avg = 102.77 resnet18 min = 92.82 max = 166.32 avg = 128.70 resnet18_int8 min = 90.29 max = 100.16 avg = 94.18 alexnet min = 139.70 max = 160.68 avg = 147.90 vgg16 min = 464.18 max = 687.42 avg = 548.92 vgg16_int8 min = 715.58 max = 809.26 avg = 748.51 resnet50 min = 192.21 max = 311.36 avg = 226.71 resnet50_int8 min = 181.12 max = 235.10 avg = 206.01 squeezenet_ssd min = 77.15 max = 103.62 avg = 85.95 squeezenet_ssd_int8 min = 88.66 max = 157.42 avg = 118.41 mobilenet_ssd min = 73.25 max = 162.26 avg = 103.62 mobilenet_ssd_int8 min = 81.04 max = 186.65 avg = 126.86 mobilenet_yolo min = 161.90 max = 255.14 avg = 199.35 mobilenetv2_yolov3 min = 96.22 max = 166.11 avg = 130.65 yolov4-tiny min = 140.02 max = 235.53 avg = 169.60 With Vulkan jetson@nano:~/ncnn/benchmark $ ./benchncnn 4 4 0 0 [0 NVIDIA Tegra X1 (nvgpu)] queueC=0[16] queueG=0[16] queueT=0[16] [0 NVIDIA Tegra X1 (nvgpu)] bugsbn1=0 buglbia=0 bugcopc=0 bugihfa=0 [0 NVIDIA Tegra X1 (nvgpu)] fp16p=1 fp16s=1 fp16a=1 int8s=1 int8a=1 [0 NVIDIA Tegra X1 (nvgpu)] subgroup=32 basic=1 vote=1 ballot=1 shuffle=1 loop_count = 4 num_threads = 4 powersave = 0 gpu_device = 0 cooling_down = 1 squeezenet min = 13.92 max = 14.07 avg = 14.01 squeezenet_int8 min = 28.52 max = 104.22 avg = 60.80 mobilenet min = 15.84 max = 16.12 avg = 15.95 mobilenet_int8 min = 46.22 max = 136.99 avg = 70.47 mobilenet_v2 min = 24.48 max = 30.38 avg = 27.81 mobilenet_v3 min = 14.29 max = 22.23 avg = 19.77 shufflenet min = 13.96 max = 14.79 avg = 14.40 shufflenet_v2 min = 23.82 max = 24.62 avg = 24.12 mnasnet min = 18.31 max = 22.95 avg = 19.70 proxylessnasnet min = 14.27 max = 14.87 avg = 14.52 efficientnet_b0 min = 31.57 max = 33.03 avg = 32.29 regnety_400m min = 17.01 max = 26.45 avg = 22.24 blazeface min = 7.38 max = 9.52 avg = 8.47 googlenet min = 41.30 max = 46.55 avg = 43.79 googlenet_int8 min = 95.72 max = 191.92 avg = 120.60 resnet18 min = 44.35 max = 46.20 avg = 45.00 resnet18_int8 min = 90.24 max = 112.45 avg = 96.23 alexnet min = 72.30 max = 74.79 avg = 73.68 vgg16 min = 295.22 max = 298.62 avg = 296.83 vgg16_int8 min = 727.67 max = 762.81 avg = 739.81 resnet50 min = 88.11 max = 94.38 avg = 92.03 resnet50_int8 min = 183.78 max = 288.11 avg = 217.60 squeezenet_ssd min = 53.65 max = 63.92 avg = 57.87 squeezenet_ssd_int8 min = 88.88 max = 193.78 avg = 120.06 mobilenet_ssd min = 36.03 max = 40.12 avg = 37.50 mobilenet_ssd_int8 min = 78.67 max = 188.17 avg = 106.49 mobilenet_yolo min = 74.52 max = 80.38 avg = 76.88 mobilenetv2_yolov3 min = 48.19 max = 51.39 avg = 49.80 yolov4-tiny min = 88.64 max = 96.50 avg = 92.92 nvdc: start nvdcEventThread nvdc: exit nvdcEventThread