For some ARM CPU with 3 different architectures (e.g. Snapdragon 8 Gen 1, Kryo 1*Cortex-X2 @3.0 GHz + 3*Cortex-A710 @2.5GHz + 4*Cortex-A510 @1.8GHz), and some small models such as NanoDet and YOLO-fastest, it may be better to set the number of threads as the number of super big cores (1) rather than the number of all big cores (4).
Here is the benchmark result of Snapdragon 8 Gen 1 on Xiaomi 12 (not root).
cupid:/data/local/tmp $ ./benchncnn 8 4 2 -1 1
loop_count = 8
num_threads = 4
powersave = 2
gpu_device = -1
cooling_down = 1
squeezenet min = 8.60 max = 11.39 avg = 10.71
squeezenet_int8 min = 8.54 max = 14.51 avg = 11.96
mobilenet min = 12.05 max = 12.24 avg = 12.12
mobilenet_int8 min = 8.03 max = 13.20 avg = 11.25
mobilenet_v2 min = 11.54 max = 11.86 avg = 11.69
mobilenet_v3 min = 11.42 max = 12.11 avg = 11.59
shufflenet min = 14.66 max = 15.24 avg = 14.84
shufflenet_v2 min = 8.79 max = 14.11 avg = 11.75
mnasnet min = 11.02 max = 17.97 avg = 12.74
proxylessnasnet min = 13.34 max = 14.14 avg = 13.72
efficientnet_b0 min = 19.76 max = 20.66 avg = 20.15
efficientnetv2_b0 min = 30.68 max = 31.13 avg = 30.90
regnety_400m min = 33.89 max = 38.96 avg = 37.00
blazeface min = 5.06 max = 5.26 avg = 5.13
googlenet min = 31.46 max = 32.76 avg = 31.98
googlenet_int8 min = 29.89 max = 30.45 avg = 30.13
resnet18 min = 17.83 max = 18.86 avg = 18.11
resnet18_int8 min = 27.98 max = 28.55 avg = 28.30
alexnet min = 22.82 max = 23.20 avg = 22.99
vgg16 min = 83.71 max = 84.35 avg = 84.05
vgg16_int8 min = 201.70 max = 202.66 avg = 202.14
resnet50 min = 51.33 max = 52.64 avg = 52.13
resnet50_int8 min = 52.42 max = 53.59 avg = 53.12
squeezenet_ssd min = 26.72 max = 27.73 avg = 27.24
squeezenet_ssd_int8 min = 32.83 max = 34.18 avg = 33.50
mobilenet_ssd min = 27.73 max = 28.50 avg = 28.26
mobilenet_ssd_int8 min = 20.95 max = 21.37 avg = 21.13
mobilenet_yolo min = 58.39 max = 59.12 avg = 58.64
mobilenetv2_yolov3 min = 33.55 max = 34.18 avg = 33.92
yolov4-tiny min = 37.59 max = 46.55 avg = 43.01
nanodet_m min = 18.85 max = 19.75 avg = 19.37
yolo-fastest-1.1 min = 12.65 max = 13.53 avg = 13.07
yolo-fastestv2 min = 11.87 max = 13.04 avg = 12.20
vision_transformer min = 942.61 max = 948.94 avg = 945.03
cupid:/data/local/tmp $ ./benchncnn 8 1 2 -1 1
loop_count = 8
num_threads = 1
powersave = 2
gpu_device = -1
cooling_down = 1
squeezenet min = 8.74 max = 8.85 avg = 8.79
squeezenet_int8 min = 6.98 max = 7.44 avg = 7.13
mobilenet min = 14.70 max = 14.91 avg = 14.77
mobilenet_int8 min = 10.94 max = 11.09 avg = 11.00
mobilenet_v2 min = 12.03 max = 12.37 avg = 12.18
mobilenet_v3 min = 10.09 max = 10.34 avg = 10.19
shufflenet min = 7.09 max = 7.31 avg = 7.20
shufflenet_v2 min = 6.83 max = 6.93 avg = 6.88
mnasnet min = 11.90 max = 12.16 avg = 11.98
proxylessnasnet min = 13.85 max = 14.20 avg = 14.08
efficientnet_b0 min = 22.13 max = 22.55 avg = 22.35
efficientnetv2_b0 min = 33.78 max = 34.25 avg = 34.09
regnety_400m min = 15.48 max = 15.67 avg = 15.58
blazeface min = 3.49 max = 3.72 avg = 3.60
googlenet min = 46.09 max = 46.59 avg = 46.39
googlenet_int8 min = 35.67 max = 35.85 avg = 35.76
resnet18 min = 26.71 max = 27.07 avg = 26.85
resnet18_int8 min = 44.50 max = 44.88 avg = 44.63
alexnet min = 40.76 max = 42.18 avg = 41.22
vgg16 min = 152.98 max = 154.04 avg = 153.54
vgg16_int8 min = 388.94 max = 389.80 avg = 389.34
resnet50 min = 83.96 max = 84.88 avg = 84.22
resnet50_int8 min = 81.42 max = 82.11 avg = 81.65
squeezenet_ssd min = 30.36 max = 30.65 avg = 30.45
squeezenet_ssd_int8 min = 36.38 max = 37.78 avg = 36.91
mobilenet_ssd min = 40.50 max = 40.87 avg = 40.66
mobilenet_ssd_int8 min = 23.74 max = 23.97 avg = 23.85
mobilenet_yolo min = 85.26 max = 86.34 avg = 85.56
mobilenetv2_yolov3 min = 44.22 max = 44.67 avg = 44.41
yolov4-tiny min = 55.02 max = 55.75 avg = 55.29
nanodet_m min = 17.00 max = 17.30 avg = 17.11
yolo-fastest-1.1 min = 6.84 max = 7.07 avg = 6.95
yolo-fastestv2 min = 6.25 max = 6.51 avg = 6.40
vision_transformer min = 1696.52 max = 1706.67 avg = 1701.45
As a whole, the speed is slower than Snapdragon 870 (Kryo 1*Cortex-A77 @3.19 GHz + 3*Cortex-A77 @2.42GHz + 4*Cortex-A55 @1.8GHz) with 4 threads.
build options:
NCNN version: 20220720
build environment: Windows 11, Visual studio 2022
NDK version: 25.0.8775105 (similar with r24)
build options: -DANDROID_ABI="arm64-v8a" -DANDROID_PLATFORM=android-24 -DNCNN_VULKAN=ON -DANDROID_USE_LEGACY_TOOLCHAIN_FILE=False ..
I tried my repo (https://github.com/Galasnow/ObjDetection), which showed similar result. Whether my option is not accurate or it can be optimized?
detail | 详细描述 | 詳細な説明
build options: NCNN version: 20220720 build environment: Windows 11, Visual studio 2022 NDK version: 25.0.8775105 (similar with r24) build options: -DANDROID_ABI="arm64-v8a" -DANDROID_PLATFORM=android-24 -DNCNN_VULKAN=ON -DANDROID_USE_LEGACY_TOOLCHAIN_FILE=False ..