Performance is slow compared to other inference frameworks.

Piorosen commented 1 year ago

I found ArmCL(Arm NEON) to be slower than ONNX and TFLite on certain embedded boards.

First of all, ONNX and TFLite were tested with CPUExecuteProvider and XNNPack, respectively.

When comparing the performance of ONNX, TFLite, and ArmCL above, embedded boards with similar results are as follows.

RK3399 in Asus Tinker Edge R
S922X on Odroid N2+

However, the embedded board where the difference in performance occurred is the Qualcomm Snapdragon 865. The version of ArmCL was v22.05, and I tried adjusting all the build options provided by ArmCL, but the performance difference was severe.

Benchmark of SD865 ONNX (CPUExecuteProvider)

AlexNet	AlexNet QINT8	VGG16	VGG16 QINT8	GoogLeNet	GoogLeNet QINT8	MobileNetV2	MobileNetV2 QINT8	ResNet 50	ResNet 50 QINT8	ResNet 101	ResNet 101 QINT8
38.3345	12.1691	281.7168	94.4056	45.8758	21.4943	26.7831	7.5031	102.8331	36.9762	193.1677	55.336
39.1297	8.9317	289.4517	80.5732	55.8126	16.936	29.0688	4.8021	116.8999	37.6365	179.9723	52.4999
37.8358	7.1217	272.7608	80.3214	42.2797	16.0477	22.6141	4.7789	106.9133	26.4037	167.1386	49.0422
28.9498	7.3354	279.3631	79.5836	92.309	15.505	23.8399	4.7935	98.129	27.0844	173.0357	72.0859
35.5161	7.3081	292.1322	83.3638	51.8395	16.3906	25.3586	4.5688	91.717	26.7634	235.0388	48.0027
37.5535	7.1328	288.3501	81.7173	46.4616	15.2209	25.7717	4.5273	82.9437	24.764	168.3184	51.0249
40.2629	6.9721	278.0248	78.941	50.9661	15.7568	22.319	4.7039	85.1937	37.1524	178.0507	44.9377
38.3167	7.1217	293.1878	79.5413	38.1839	15.9927	23.0271	5.806	86.8026	38.7036	169.3151	44.1762
40.5191	7.1071	320.1782	80.041	48.3489	15.4871	25.7002	4.5956	126.057	30.7258	173.5559	44.3704
39.7134	7.2459	279.2085	85.0172	42.3679	17.6157	22.3541	4.5471	97.731	28.9419	168.0849	43.7054

ArmCL v22.05(Arm NEON) AlexNet	AlexNet QINT8	VGG16	VGG16 QINT8	GoogLeNet	GoogLeNet QINT8	MobileNetV2	MobileNetV2 QINT8	ResNet 50	ResNet 50 QINT8	ResNet 101	ResNet 101 QINT8
58.06925	37.2176	289.629	200.957	240.905	193.058	117.482	108.732	268.032	240.661	576.126	362.058
63.046375	35.6677	300.466	159.684	242.869	202.032	115.116	121.453	293.095	194.97	543.541	373.276
56.26025	30.3409	319.452	158.902	243.346	200.536	121.989	122.098	325.789	214.423	590.161	361.561
58.05075	31.3844	306.13	194.298	254.771	199.482	116.153	115.621	302.584	203.908	654.428	365.632
55.13575	29.6613	287.006	202.352	244.735	199.572	117.204	116.405	305.219	200.815	566.976	372.79
56.55675	36.9412	253.977	214.631	244.002	195.876	118.052	117.617	302.376	199.354	571.369	368.249
60.6535	32.8701	305.209	181.447	259.037	205.209	123.382	117.075	306.66	200.787	535.525	363.008
62.070125	93.5256	321.244	216.368	203.619	207.432	112.584	117.425	314.503	199.152	658.316	386.758
60.016625	94.8739	278.612	213.177	229.707	202.618	121.173	118.833	272.14	196.816	572.022	356.283
60.450625	28.6663	309.026	192.158	235.997	204.26	113.729	115.064	299.979	218.205	551.699	376.76

TFLite (XNNPack) 4 Thread AlexNet	AlexNet QINT8	VGG16	VGG16 QINT8	GoogLeNet	GoogLeNet QINT8	MobileNetV2	MobileNetV2 QINT8	ResNet 50	ResNet 50 QINT8	ResNet 101	ResNet 101 QINT8
67.95396	35.01802	450.18432	157.06177	57.67469	17.05917	45.47724	13.88859	131.51734	41.43734	229.63667	66.05859
32.17708	13.35260	420.85891	103.10005	42.92714	13.94734	29.06016	11.91286	106.25135	31.13818	204.75693	79.66724
35.58182	10.06089	431.62682	98.99927	48.75568	13.50031	28.68505	12.12484	107.80568	31.15880	224.97391	86.48990
32.26401	8.06141	444.58401	100.20031	42.48453	13.26958	32.29094	11.91156	105.93490	31.10813	248.10292	59.50656
33.55641	7.80188	437.85505	99.94177	43.13401	13.24510	32.44620	12.11938	110.83026	31.16141	220.18510	52.49328
32.53490	7.60229	481.61323	100.35781	44.86297	13.40844	31.96354	13.15448	106.71531	34.20901	215.51766	52.45620
40.11854	7.56542	447.29802	98.84641	43.21141	13.60635	32.11599	11.85708	107.05620	33.80010	217.48245	52.31333
32.30703	7.61193	456.78573	100.30104	43.09865	13.34370	33.06484	11.80552	108.13130	35.27177	210.23427	52.51896
33.05042	7.66818	442.34990	128.54818	48.62234	13.28115	48.47500	11.77193	108.37094	31.63438	212.91656	52.50328
31.73922	7.95047	464.20146	120.10677	44.24594	13.22453	25.66807	11.84547	108.36964	31.55859	210.44651	52.47479

morgolock commented 1 year ago

Hi @Piorosen

Could you please share the commands used to run the models ?

Piorosen commented 1 year ago

All experiments(ONNX, TFLite, ArmCL) were done in the adb environment on android (based on termux). termux is a virtual shell for android, so it is the same environment as adb.

ArmCL used NDK r21b, and the build was done based on the address below. https://arm-software.github.io/ComputeLibrary/v22.05/how_to_build.xhtml

ONNX

def inference(model_name):
    model_file = model_name + ".onnx"

    sess_options = ort.SessionOptions()
    ort_sess = ort.InferenceSession(model_file, sess_options, providers=["CPUExecutionProvider"])

    x = np.random.rand(ort_sess.get_inputs()[0].shape[0], 
                    ort_sess.get_inputs()[0].shape[1], 
                    ort_sess.get_inputs()[0].shape[2], 
                    ort_sess.get_inputs()[0].shape[3]).astype(np.float32)

    ort_inputs = {ort_sess.get_inputs()[0].name: x}

    repeat = 10
    for idx in range(0, repeat):
        start = time.perf_counter_ns()
        ort_sess.run(None, ort_inputs)
        end = time.perf_counter_ns()
        print("[ " + str(idx + 1) + " / "+ str(repeat) + " ]\t" + str((end - start) / 1000 / 1000) + " ms");

if __name__ == "__main__":
    inference("./resnet101")

TFLite (4 Threads (Reason: ArmCL's internal CPPThread creates as many minimum frequency threads as the core has))

def inference(model_name):
    interpreter = tflite.Interpreter(model_path=model_name + ".tflite")
    input_details = interpreter.get_input_details()
    output_details = interpreter.get_output_details()
    np_features = np.random.rand(input_details[0]['shape'][1], 
                                 input_details[0]['shape'][2], 
                                 input_details[0]['shape'][3]).astype(input_details[0]['dtype'])

    np_features = np.expand_dims(np_features, axis=0)

    for _ in range(10):
        interpreter.allocate_tensors()
        interpreter.set_tensor(input_details[0]['index'], np_features)

        start = time.perf_counter_ns()
        interpreter.invoke()
        end = time.perf_counter_ns()
        print((end - start) / 1000.0 / 1000.0)

if __name__ == "__main__":
    inference("./resnet101")

Piorosen commented 1 year ago

@morgolock Writing ArmCL-specific code has been implemented in the same/similar way as https://github.com/ARM-software/ComputeLibrary/tree/main/examples.

morgolock commented 1 year ago

Hi @Piorosen

I think there is a problem in the way measurements are being taken, specially if the test harnesses are different. The figures for each framework are produced by different tools, this can cause problems when assessing the performance.

There is no need to reimplement the models using ACL, if you wish to run any tflite model through ACL you could just use the tflite benchmark tool along with the ArmNN delegate for tflite. Using this tool you can run any tflite model and get the performance measurement for XNNPACK and ArmNN+ACL.

You can get ArmNN prebuilt binaries with the delegate from https://github.com/ARM-software/armnn/releases/tag/v23.05

See below

XNNPACK

./android_aarch64_benchmark_model  --graph=../mobilenet_v2_1.0_224_1_default_1.tflite --num_threads=4 --num_runs=120 --warmup_runs=1                                                                   <
INFO: STARTING!
INFO: Log parameter values verbosely: [0]
INFO: Min num runs: [120]
INFO: Num threads: [4]
INFO: Min warmup runs: [1]
INFO: Graph: [../mobilenet_v2_1.0_224_1_default_1.tflite]
INFO: #threads used for CPU inference: [4]
INFO: Loaded model ../mobilenet_v2_1.0_224_1_default_1.tflite
INFO: Initialized TensorFlow Lite runtime.
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
VERBOSE: Replacing 66 out of 66 node(s) with delegate (TfLiteXNNPackDelegate) node, yielding 1 partitions for the whole graph.
INFO: The input model file size (MB): 13.9786
INFO: Initialized session in 18.388ms.
INFO: Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
INFO: count=25 first=24235 curr=180548 min=16508 max=180548 avg=24709.8 std=31892

INFO: Running benchmark for at least 120 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
INFO: count=120 first=48530 curr=17102 min=16575 max=73521 avg=23906.4 std=10379

INFO: Inference timings in us: Init: 18388, First inference: 24235, Warmup (avg): 24709.8, Inference (avg): 23906.4
INFO: Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
INFO: Memory footprint delta from the start of the tool (MB): init=29.9531 overall=38.0703

ArmNN Delegate

LD_LIBRARY_PATH=.:$LD_LIBRARY_PATH ./android_aarch64_benchmark_model  --graph=../mobilenet_v2_1.0_224_1_default_1.tflite --num_threads=4 --num_runs=120 --warmup_runs=1 --external_delegate_path="./libarmnnDelegate.so" --external_delegate_options="backends:CpuAcc"
INFO: STARTING!
INFO: Log parameter values verbosely: [0]
INFO: Min num runs: [120]
INFO: Num threads: [4]
INFO: Min warmup runs: [1]
INFO: Graph: [../mobilenet_v2_1.0_224_1_default_1.tflite]
INFO: #threads used for CPU inference: [4]
INFO: External delegate path: [./libarmnnDelegate.so]
INFO: External delegate options: [backends:CpuAcc]
INFO: Loaded model ../mobilenet_v2_1.0_224_1_default_1.tflite
INFO: Initialized TensorFlow Lite runtime.
Couldn't find any of the following OpenCL library: libOpenCL.so libGLES_mali.so libmali.so libOpenCL-pixel.so libOpenCL-car.so 
INFO: TfLiteArmnnDelegate: Created TfLite ArmNN delegate.
INFO: EXTERNAL delegate created.
VERBOSE: Replacing 66 out of 66 node(s) with delegate (TfLiteArmNnDelegate) node, yielding 1 partitions for the whole graph.
INFO: Explicitly applied EXTERNAL delegate, and the model graph will be completely executed by the delegate.
INFO: The input model file size (MB): 13.9786
INFO: Initialized session in 90.935ms.
INFO: Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
INFO: count=20 first=114287 curr=22206 min=19009 max=114287 avg=25709.2 std=20497

INFO: Running benchmark for at least 120 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
INFO: count=120 first=22213 curr=23318 min=18924 max=71473 avg=25735.2 std=9032

INFO: Inference timings in us: Init: 90935, First inference: 114287, Warmup (avg): 25709.2, Inference (avg): 25735.2
INFO: Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
INFO: Memory footprint delta from the start of the tool (MB): init=115.688 overall=134.383

Piorosen commented 1 year ago

Thank you. @morgolock

I was able to check it out, I was able to compare performance, and I was able to see what the problem was. Everything is fixed, so close that issue.

ARM-software / ComputeLibrary

Performance is slow compared to other inference frameworks. #1057