Tencent / ncnn

ncnn is a high-performance neural network inference framework optimized for the mobile platform
Other
19.71k stars 4.1k forks source link

Performance on ARM CPU with 3 different architectures #4055

Open Galasnow opened 1 year ago

Galasnow commented 1 year ago

detail | 详细描述 | 詳細な説明

For some ARM CPU with 3 different architectures (e.g. Snapdragon 8 Gen 1, Kryo 1*Cortex-X2 @3.0 GHz + 3*Cortex-A710 @2.5GHz + 4*Cortex-A510 @1.8GHz), and some small models such as NanoDet and  YOLO-fastest, it may be better to set the number of threads as the number of super big cores (1) rather than the number of all big cores (4).
Here is the benchmark result of Snapdragon 8 Gen 1 on Xiaomi 12 (not root).
cupid:/data/local/tmp $ ./benchncnn 8 4 2 -1 1
loop_count = 8
num_threads = 4
powersave = 2
gpu_device = -1
cooling_down = 1
          squeezenet  min =    8.60  max =   11.39  avg =   10.71
     squeezenet_int8  min =    8.54  max =   14.51  avg =   11.96
           mobilenet  min =   12.05  max =   12.24  avg =   12.12
      mobilenet_int8  min =    8.03  max =   13.20  avg =   11.25
        mobilenet_v2  min =   11.54  max =   11.86  avg =   11.69
        mobilenet_v3  min =   11.42  max =   12.11  avg =   11.59
          shufflenet  min =   14.66  max =   15.24  avg =   14.84
       shufflenet_v2  min =    8.79  max =   14.11  avg =   11.75
             mnasnet  min =   11.02  max =   17.97  avg =   12.74
     proxylessnasnet  min =   13.34  max =   14.14  avg =   13.72
     efficientnet_b0  min =   19.76  max =   20.66  avg =   20.15
   efficientnetv2_b0  min =   30.68  max =   31.13  avg =   30.90
        regnety_400m  min =   33.89  max =   38.96  avg =   37.00
           blazeface  min =    5.06  max =    5.26  avg =    5.13
           googlenet  min =   31.46  max =   32.76  avg =   31.98
      googlenet_int8  min =   29.89  max =   30.45  avg =   30.13
            resnet18  min =   17.83  max =   18.86  avg =   18.11
       resnet18_int8  min =   27.98  max =   28.55  avg =   28.30
             alexnet  min =   22.82  max =   23.20  avg =   22.99
               vgg16  min =   83.71  max =   84.35  avg =   84.05
          vgg16_int8  min =  201.70  max =  202.66  avg =  202.14
            resnet50  min =   51.33  max =   52.64  avg =   52.13
       resnet50_int8  min =   52.42  max =   53.59  avg =   53.12
      squeezenet_ssd  min =   26.72  max =   27.73  avg =   27.24
 squeezenet_ssd_int8  min =   32.83  max =   34.18  avg =   33.50
       mobilenet_ssd  min =   27.73  max =   28.50  avg =   28.26
  mobilenet_ssd_int8  min =   20.95  max =   21.37  avg =   21.13
      mobilenet_yolo  min =   58.39  max =   59.12  avg =   58.64
  mobilenetv2_yolov3  min =   33.55  max =   34.18  avg =   33.92
         yolov4-tiny  min =   37.59  max =   46.55  avg =   43.01
           nanodet_m  min =   18.85  max =   19.75  avg =   19.37
    yolo-fastest-1.1  min =   12.65  max =   13.53  avg =   13.07
      yolo-fastestv2  min =   11.87  max =   13.04  avg =   12.20
  vision_transformer  min =  942.61  max =  948.94  avg =  945.03

cupid:/data/local/tmp $ ./benchncnn 8 1 2 -1 1
loop_count = 8
num_threads = 1
powersave = 2
gpu_device = -1
cooling_down = 1
          squeezenet  min =    8.74  max =    8.85  avg =    8.79
     squeezenet_int8  min =    6.98  max =    7.44  avg =    7.13
           mobilenet  min =   14.70  max =   14.91  avg =   14.77
      mobilenet_int8  min =   10.94  max =   11.09  avg =   11.00
        mobilenet_v2  min =   12.03  max =   12.37  avg =   12.18
        mobilenet_v3  min =   10.09  max =   10.34  avg =   10.19
          shufflenet  min =    7.09  max =    7.31  avg =    7.20
       shufflenet_v2  min =    6.83  max =    6.93  avg =    6.88
             mnasnet  min =   11.90  max =   12.16  avg =   11.98
     proxylessnasnet  min =   13.85  max =   14.20  avg =   14.08
     efficientnet_b0  min =   22.13  max =   22.55  avg =   22.35
   efficientnetv2_b0  min =   33.78  max =   34.25  avg =   34.09
        regnety_400m  min =   15.48  max =   15.67  avg =   15.58
           blazeface  min =    3.49  max =    3.72  avg =    3.60
           googlenet  min =   46.09  max =   46.59  avg =   46.39
      googlenet_int8  min =   35.67  max =   35.85  avg =   35.76
            resnet18  min =   26.71  max =   27.07  avg =   26.85
       resnet18_int8  min =   44.50  max =   44.88  avg =   44.63
             alexnet  min =   40.76  max =   42.18  avg =   41.22
               vgg16  min =  152.98  max =  154.04  avg =  153.54
          vgg16_int8  min =  388.94  max =  389.80  avg =  389.34
            resnet50  min =   83.96  max =   84.88  avg =   84.22
       resnet50_int8  min =   81.42  max =   82.11  avg =   81.65
      squeezenet_ssd  min =   30.36  max =   30.65  avg =   30.45
 squeezenet_ssd_int8  min =   36.38  max =   37.78  avg =   36.91
       mobilenet_ssd  min =   40.50  max =   40.87  avg =   40.66
  mobilenet_ssd_int8  min =   23.74  max =   23.97  avg =   23.85
      mobilenet_yolo  min =   85.26  max =   86.34  avg =   85.56
  mobilenetv2_yolov3  min =   44.22  max =   44.67  avg =   44.41
         yolov4-tiny  min =   55.02  max =   55.75  avg =   55.29
           nanodet_m  min =   17.00  max =   17.30  avg =   17.11
    yolo-fastest-1.1  min =    6.84  max =    7.07  avg =    6.95
      yolo-fastestv2  min =    6.25  max =    6.51  avg =    6.40
  vision_transformer  min = 1696.52  max = 1706.67  avg = 1701.45
As a whole, the speed is slower than Snapdragon 870 (Kryo 1*Cortex-A77 @3.19 GHz + 3*Cortex-A77 @2.42GHz + 4*Cortex-A55 @1.8GHz) with 4 threads.

build options: NCNN version: 20220720 build environment: Windows 11, Visual studio 2022 NDK version: 25.0.8775105 (similar with r24) build options: -DANDROID_ABI="arm64-v8a" -DANDROID_PLATFORM=android-24 -DNCNN_VULKAN=ON -DANDROID_USE_LEGACY_TOOLCHAIN_FILE=False ..

I tried my repo (https://github.com/Galasnow/ObjDetection), which showed similar result. Whether my option is not accurate or it can be optimized?
Galasnow commented 1 year ago

Thanks a lot QAQ!

nihui commented 1 year ago

we may need ncnn::get_very_big_cpu_count()