ARM-software / ComputeLibrary

The Compute Library is a set of computer vision and machine learning functions optimised for both Arm CPUs and GPUs using SIMD technologies.
MIT License
2.86k stars 779 forks source link

Performance is slow compared to other inference frameworks. #1057

Closed Piorosen closed 1 year ago

Piorosen commented 1 year ago

I found ArmCL(Arm NEON) to be slower than ONNX and TFLite on certain embedded boards.

First of all, ONNX and TFLite were tested with CPUExecuteProvider and XNNPack, respectively.

When comparing the performance of ONNX, TFLite, and ArmCL above, embedded boards with similar results are as follows.

  1. RK3399 in Asus Tinker Edge R
  2. S922X on Odroid N2+

However, the embedded board where the difference in performance occurred is the Qualcomm Snapdragon 865. The version of ArmCL was v22.05, and I tried adjusting all the build options provided by ArmCL, but the performance difference was severe.

Benchmark of SD865 ONNX (CPUExecuteProvider)

AlexNet AlexNet QINT8 VGG16 VGG16 QINT8 GoogLeNet GoogLeNet QINT8 MobileNetV2 MobileNetV2 QINT8 ResNet 50 ResNet 50 QINT8 ResNet 101 ResNet 101 QINT8
38.3345 12.1691 281.7168 94.4056 45.8758 21.4943 26.7831 7.5031 102.8331 36.9762 193.1677 55.336
39.1297 8.9317 289.4517 80.5732 55.8126 16.936 29.0688 4.8021 116.8999 37.6365 179.9723 52.4999
37.8358 7.1217 272.7608 80.3214 42.2797 16.0477 22.6141 4.7789 106.9133 26.4037 167.1386 49.0422
28.9498 7.3354 279.3631 79.5836 92.309 15.505 23.8399 4.7935 98.129 27.0844 173.0357 72.0859
35.5161 7.3081 292.1322 83.3638 51.8395 16.3906 25.3586 4.5688 91.717 26.7634 235.0388 48.0027
37.5535 7.1328 288.3501 81.7173 46.4616 15.2209 25.7717 4.5273 82.9437 24.764 168.3184 51.0249
40.2629 6.9721 278.0248 78.941 50.9661 15.7568 22.319 4.7039 85.1937 37.1524 178.0507 44.9377
38.3167 7.1217 293.1878 79.5413 38.1839 15.9927 23.0271 5.806 86.8026 38.7036 169.3151 44.1762
40.5191 7.1071 320.1782 80.041 48.3489 15.4871 25.7002 4.5956 126.057 30.7258 173.5559 44.3704
39.7134 7.2459 279.2085 85.0172 42.3679 17.6157 22.3541 4.5471 97.731 28.9419 168.0849 43.7054
ArmCL v22.05(Arm NEON) AlexNet AlexNet QINT8 VGG16 VGG16 QINT8 GoogLeNet GoogLeNet QINT8 MobileNetV2 MobileNetV2 QINT8 ResNet 50 ResNet 50 QINT8 ResNet 101 ResNet 101 QINT8
58.06925 37.2176 289.629 200.957 240.905 193.058 117.482 108.732 268.032 240.661 576.126 362.058
63.046375 35.6677 300.466 159.684 242.869 202.032 115.116 121.453 293.095 194.97 543.541 373.276
56.26025 30.3409 319.452 158.902 243.346 200.536 121.989 122.098 325.789 214.423 590.161 361.561
58.05075 31.3844 306.13 194.298 254.771 199.482 116.153 115.621 302.584 203.908 654.428 365.632
55.13575 29.6613 287.006 202.352 244.735 199.572 117.204 116.405 305.219 200.815 566.976 372.79
56.55675 36.9412 253.977 214.631 244.002 195.876 118.052 117.617 302.376 199.354 571.369 368.249
60.6535 32.8701 305.209 181.447 259.037 205.209 123.382 117.075 306.66 200.787 535.525 363.008
62.070125 93.5256 321.244 216.368 203.619 207.432 112.584 117.425 314.503 199.152 658.316 386.758
60.016625 94.8739 278.612 213.177 229.707 202.618 121.173 118.833 272.14 196.816 572.022 356.283
60.450625 28.6663 309.026 192.158 235.997 204.26 113.729 115.064 299.979 218.205 551.699 376.76
TFLite (XNNPack) 4 Thread AlexNet AlexNet QINT8 VGG16 VGG16 QINT8 GoogLeNet GoogLeNet QINT8 MobileNetV2 MobileNetV2 QINT8 ResNet 50 ResNet 50 QINT8 ResNet 101 ResNet 101 QINT8
67.95396 35.01802 450.18432 157.06177 57.67469 17.05917 45.47724 13.88859 131.51734 41.43734 229.63667 66.05859
32.17708 13.35260 420.85891 103.10005 42.92714 13.94734 29.06016 11.91286 106.25135 31.13818 204.75693 79.66724
35.58182 10.06089 431.62682 98.99927 48.75568 13.50031 28.68505 12.12484 107.80568 31.15880 224.97391 86.48990
32.26401 8.06141 444.58401 100.20031 42.48453 13.26958 32.29094 11.91156 105.93490 31.10813 248.10292 59.50656
33.55641 7.80188 437.85505 99.94177 43.13401 13.24510 32.44620 12.11938 110.83026 31.16141 220.18510 52.49328
32.53490 7.60229 481.61323 100.35781 44.86297 13.40844 31.96354 13.15448 106.71531 34.20901 215.51766 52.45620
40.11854 7.56542 447.29802 98.84641 43.21141 13.60635 32.11599 11.85708 107.05620 33.80010 217.48245 52.31333
32.30703 7.61193 456.78573 100.30104 43.09865 13.34370 33.06484 11.80552 108.13130 35.27177 210.23427 52.51896
33.05042 7.66818 442.34990 128.54818 48.62234 13.28115 48.47500 11.77193 108.37094 31.63438 212.91656 52.50328
31.73922 7.95047 464.20146 120.10677 44.24594 13.22453 25.66807 11.84547 108.36964 31.55859 210.44651 52.47479

morgolock commented 1 year ago

Hi @Piorosen

Could you please share the commands used to run the models ?

Piorosen commented 1 year ago

All experiments(ONNX, TFLite, ArmCL) were done in the adb environment on android (based on termux). termux is a virtual shell for android, so it is the same environment as adb.

ArmCL used NDK r21b, and the build was done based on the address below. https://arm-software.github.io/ComputeLibrary/v22.05/how_to_build.xhtml

Qualcomm Snapdargon 865 is armv8.2 among the build options below, but without SVE. When I build with SVE, I get an error saying Illegal Instruction. armv7a|x86_32|x86_64|armv8a|armv8.2-a|armv8.2-a-sve|armv8.2-a-sve2|armv8.6-a|armv8.6-a-sve|armv8.6-a-sve2|armv8r64|x86

ONNX

def inference(model_name):
    model_file = model_name + ".onnx"

    sess_options = ort.SessionOptions()
    ort_sess = ort.InferenceSession(model_file, sess_options, providers=["CPUExecutionProvider"])

    x = np.random.rand(ort_sess.get_inputs()[0].shape[0], 
                    ort_sess.get_inputs()[0].shape[1], 
                    ort_sess.get_inputs()[0].shape[2], 
                    ort_sess.get_inputs()[0].shape[3]).astype(np.float32)

    ort_inputs = {ort_sess.get_inputs()[0].name: x}

    repeat = 10
    for idx in range(0, repeat):
        start = time.perf_counter_ns()
        ort_sess.run(None, ort_inputs)
        end = time.perf_counter_ns()
        print("[ " + str(idx + 1) + " / "+ str(repeat) + " ]\t" + str((end - start) / 1000 / 1000) + " ms");

if __name__ == "__main__":
    inference("./resnet101")

TFLite (4 Threads (Reason: ArmCL's internal CPPThread creates as many minimum frequency threads as the core has))

def inference(model_name):
    interpreter = tflite.Interpreter(model_path=model_name + ".tflite")
    input_details = interpreter.get_input_details()
    output_details = interpreter.get_output_details()
    np_features = np.random.rand(input_details[0]['shape'][1], 
                                 input_details[0]['shape'][2], 
                                 input_details[0]['shape'][3]).astype(input_details[0]['dtype'])

    np_features = np.expand_dims(np_features, axis=0)

    for _ in range(10):
        interpreter.allocate_tensors()
        interpreter.set_tensor(input_details[0]['index'], np_features)

        start = time.perf_counter_ns()
        interpreter.invoke()
        end = time.perf_counter_ns()
        print((end - start) / 1000.0 / 1000.0)

if __name__ == "__main__":
    inference("./resnet101")
Piorosen commented 1 year ago

@morgolock Writing ArmCL-specific code has been implemented in the same/similar way as https://github.com/ARM-software/ComputeLibrary/tree/main/examples.

morgolock commented 1 year ago

Hi @Piorosen

I think there is a problem in the way measurements are being taken, specially if the test harnesses are different. The figures for each framework are produced by different tools, this can cause problems when assessing the performance.

There is no need to reimplement the models using ACL, if you wish to run any tflite model through ACL you could just use the tflite benchmark tool along with the ArmNN delegate for tflite. Using this tool you can run any tflite model and get the performance measurement for XNNPACK and ArmNN+ACL.

You can get ArmNN prebuilt binaries with the delegate from https://github.com/ARM-software/armnn/releases/tag/v23.05

See below

XNNPACK

./android_aarch64_benchmark_model  --graph=../mobilenet_v2_1.0_224_1_default_1.tflite --num_threads=4 --num_runs=120 --warmup_runs=1                                                                   <
INFO: STARTING!
INFO: Log parameter values verbosely: [0]
INFO: Min num runs: [120]
INFO: Num threads: [4]
INFO: Min warmup runs: [1]
INFO: Graph: [../mobilenet_v2_1.0_224_1_default_1.tflite]
INFO: #threads used for CPU inference: [4]
INFO: Loaded model ../mobilenet_v2_1.0_224_1_default_1.tflite
INFO: Initialized TensorFlow Lite runtime.
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
VERBOSE: Replacing 66 out of 66 node(s) with delegate (TfLiteXNNPackDelegate) node, yielding 1 partitions for the whole graph.
INFO: The input model file size (MB): 13.9786
INFO: Initialized session in 18.388ms.
INFO: Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
INFO: count=25 first=24235 curr=180548 min=16508 max=180548 avg=24709.8 std=31892

INFO: Running benchmark for at least 120 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
INFO: count=120 first=48530 curr=17102 min=16575 max=73521 avg=23906.4 std=10379

INFO: Inference timings in us: Init: 18388, First inference: 24235, Warmup (avg): 24709.8, Inference (avg): 23906.4
INFO: Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
INFO: Memory footprint delta from the start of the tool (MB): init=29.9531 overall=38.0703

ArmNN Delegate

LD_LIBRARY_PATH=.:$LD_LIBRARY_PATH ./android_aarch64_benchmark_model  --graph=../mobilenet_v2_1.0_224_1_default_1.tflite --num_threads=4 --num_runs=120 --warmup_runs=1 --external_delegate_path="./libarmnnDelegate.so" --external_delegate_options="backends:CpuAcc"
INFO: STARTING!
INFO: Log parameter values verbosely: [0]
INFO: Min num runs: [120]
INFO: Num threads: [4]
INFO: Min warmup runs: [1]
INFO: Graph: [../mobilenet_v2_1.0_224_1_default_1.tflite]
INFO: #threads used for CPU inference: [4]
INFO: External delegate path: [./libarmnnDelegate.so]
INFO: External delegate options: [backends:CpuAcc]
INFO: Loaded model ../mobilenet_v2_1.0_224_1_default_1.tflite
INFO: Initialized TensorFlow Lite runtime.
Couldn't find any of the following OpenCL library: libOpenCL.so libGLES_mali.so libmali.so libOpenCL-pixel.so libOpenCL-car.so 
INFO: TfLiteArmnnDelegate: Created TfLite ArmNN delegate.
INFO: EXTERNAL delegate created.
VERBOSE: Replacing 66 out of 66 node(s) with delegate (TfLiteArmNnDelegate) node, yielding 1 partitions for the whole graph.
INFO: Explicitly applied EXTERNAL delegate, and the model graph will be completely executed by the delegate.
INFO: The input model file size (MB): 13.9786
INFO: Initialized session in 90.935ms.
INFO: Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
INFO: count=20 first=114287 curr=22206 min=19009 max=114287 avg=25709.2 std=20497

INFO: Running benchmark for at least 120 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
INFO: count=120 first=22213 curr=23318 min=18924 max=71473 avg=25735.2 std=9032

INFO: Inference timings in us: Init: 90935, First inference: 114287, Warmup (avg): 25709.2, Inference (avg): 25735.2
INFO: Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
INFO: Memory footprint delta from the start of the tool (MB): init=115.688 overall=134.383
Piorosen commented 1 year ago

Thank you. @morgolock

I was able to check it out, I was able to compare performance, and I was able to see what the problem was. Everything is fixed, so close that issue.