I found ArmCL(Arm NEON) to be slower than ONNX and TFLite on certain embedded boards.
First of all, ONNX and TFLite were tested with CPUExecuteProvider and XNNPack, respectively.
When comparing the performance of ONNX, TFLite, and ArmCL above, embedded boards with similar results are as follows.
RK3399 in Asus Tinker Edge R
S922X on Odroid N2+
However, the embedded board where the difference in performance occurred is the Qualcomm Snapdragon 865.
The version of ArmCL was v22.05, and I tried adjusting all the build options provided by ArmCL, but the performance difference was severe.
I found ArmCL(Arm NEON) to be slower than ONNX and TFLite on certain embedded boards.
First of all, ONNX and TFLite were tested with CPUExecuteProvider and XNNPack, respectively.
When comparing the performance of ONNX, TFLite, and ArmCL above, embedded boards with similar results are as follows.
However, the embedded board where the difference in performance occurred is the Qualcomm Snapdragon 865. The version of ArmCL was v22.05, and I tried adjusting all the build options provided by ArmCL, but the performance difference was severe.
Benchmark of SD865 ONNX (CPUExecuteProvider)