ARM-software / ComputeLibrary

The Compute Library is a set of computer vision and machine learning functions optimised for both Arm CPUs and GPUs using SIMD technologies.
MIT License
2.75k stars 767 forks source link

GPU Convolution for INT8 is slower than FP16 #1034

Closed srikris-sridhar closed 8 months ago

srikris-sridhar commented 1 year ago

Based on an ARM-NN issue filed . @TeresaARM requested that I file a separate issue with the ARM Compute library.

Output of 'strings libarm_compute.so | grep arm_compute_version':

arm_compute_version=v22.11 Build options: {'arch': 'arm64-v8a', 'build_dir': 'release/arm64-v8a', 'debug': '0', 'os': 'android', 'build': 'cross_compile', 'extra_cxx_flags': '-fPIE -fPIC -g0', 'toolchain_prefix': '/opt/homebrew/share/android-commandlinetools/ndk/25.1.8937393/toolchains/llvm/prebuilt/darwin-x86_64/bin/llvm-', 'compiler_prefix': '/opt/homebrew/share/android-commandlinetools/ndk/25.1.8937393/toolchains/llvm/prebuilt/darwin-x86_64/bin/aarch64-linux-android24-', 'neon': '1', 'opencl': '1', 'embed_kernels': '1', 'examples': '0', 'gemm_tuner': '0', 'benchmark_tests': '0', 'validation_tests': '0'} Git hash=b'1b3192e8a23513031163dc14d248f47671986121'

Platform: Samsung A33, Mali GPU G68 Operating System: Android 12

Problem description:

I've tried running a really simple convolution (attached 2 models, one in FP32 and one in INT8). Here is what I see on an Samsung A33, Mali GPU G68. I see a good boost with INT8 on CPU but not GPU. Is this expected?

morgolock commented 1 year ago

Hi @srikris-sridhar

GPU Convolution for INT8 is slower than FP16

Just please clarify if you meant FP32 instead of FP16.

Do you mean int8 model being slower than fp32 in GpuAcc running on Mali GPU G68?

srikris-sridhar commented 1 year ago

I meant FP16, not FP32. Sorry for the typo in the numbers.

srikris-sridhar commented 1 year ago

This is technically the following (since TFLite downcasts to FP16 for inference)

CPU (INT8): 2.5ms CPU (FP16): 6.3ms GPU (INT8): 7.2ms GPU (FP16): 8.4ms

morgolock commented 1 year ago

Hi @srikris-sridhar

I can confirm that I reproduced the problem on Mali-G57, see below:

LD_LIBRARY_PATH=.:$LD_LIBRARY_PATH ./ExecuteNetwork -c GpuAcc -m ./simple_conv_int8.tflite   --iterations 12 | grep Inference
Info: Inference time: 41.74 ms
Info: Inference time: 30.60 ms
Info: Inference time: 31.26 ms
Info: Inference time: 30.85 ms
Info: Inference time: 29.64 ms
Info: Inference time: 30.25 ms
Info: Inference time: 30.24 ms
Info: Inference time: 30.35 ms
Info: Inference time: 30.75 ms
Info: Inference time: 31.25 ms
Info: Inference time: 30.72 ms
Info: Inference time: 30.00 ms
LD_LIBRARY_PATH=.:$LD_LIBRARY_PATH ./ExecuteNetwork -c GpuAcc -m ./simple_conv_fp32.tflite  --iterations 12 | grep Inference
Info: Inference time: 36.13 ms
Info: Inference time: 23.24 ms
Info: Inference time: 23.34 ms
Info: Inference time: 22.88 ms
Info: Inference time: 22.97 ms
Info: Inference time: 22.80 ms
Info: Inference time: 23.23 ms
Info: Inference time: 23.43 ms
Info: Inference time: 20.62 ms
Info: Inference time: 23.17 ms
Info: Inference time: 19.25 ms
Info: Inference time: 23.19 ms

Discussing this with the team it looks like there is some work to be done in the GEMM heuristics to choose the right block size.

Hope this helps.

srikris-sridhar commented 1 year ago

Awesome. Thanks! Looking forward to seeing what can be done here. This seems consistent across many models.

morgolock commented 8 months ago

Thanks for reporting this, we added the request to our backlog.

Closing the issue as we are already tracking it internally.