ARM-software / ComputeLibrary

The Compute Library is a set of computer vision and machine learning functions optimised for both Arm CPUs and GPUs using SIMD technologies.
MIT License
2.82k stars 774 forks source link

how to set tensor_info/fc_info for FullyConnectedLayer #1050

Closed shiyuanren closed 1 year ago

shiyuanren commented 1 year ago

Output of 'strings libarm_compute.so | grep arm_compute_version': arm_compute_version=v23.02.1 Build options: {'standalone': '1', 'embed_kernels': '0', 'gemm_tuner': '0', 'build': 'cross_compile', 'build_dir': './0417-full', 'Werror': '0', 'logging': '1', 'debug': '0', 'asserts': '1', 'neon': '1', 'opencl': '0', 'os': 'bare_metal', 'cppthreads': '0', 'openmp': '0', 'arch': 'armv8a', 'estate': '32', 'toolchain_prefix': 'xxx//arm-none-eabi-'} Git hash=b'd8bf9b53752a4f573120cf51b31055de8b3c7d29'

Platform: cortex-A32 Operating System: freertos

Problem description: when use acl NEFullyConnectedLayer(QASYMM8_SIGNED) for tflite-micro optimization, I got a wrong result in GEMMLowp mm32 result. The input demision [40], weight demision of [512, 40], out demision of [512], the TensorInfo shows below,

auto src_info = arm_compute::TensorInfo( arm_compute::TensorShape(40), 1, arm_compute::DataType::QASYMM8_SIGNED, arm_compute::QuantizationInfo(1,0));

auto weight_info = arm_compute::TensorInfo( arm_compute::TensorShape(512, 40), 1, arm_compute::DataType::QASYMM8_SIGNED, arm_compute::QuantizationInfo(1,0));

auto dst_info = arm_compute::TensorInfo( arm_compute::TensorShape(512), 1, arm_compute::DataType::QASYMM8_SIGNED, arm_compute::QuantizationInfo(scale, zero_point));

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

**input** 18 | 1 | 9 | 11 | 8 | 1 | 11 | -3 | -11 | -7 | -26 | -10 | 4 | -5 | -2 | -15 | -9 | -22 | -18 | -3 | -12 | -3 | 0 | 3 | -6 | -19 | -20 | -13 | -11 | -4 | -6 | -2 | -7 | -1 | -1 | -1 | -2 | -7 | -10 | -17 **weight-1st-row** -16 | -30 | 51 | 59 | -7 | -3 | -16 | -4 | 1 | 2 | -1 | -4 | 0 | 2 | -1 | 0 | -2 | -1 | -1 | 3 | -2 | -2 | -1 | 1 | -1 | -4 | 4 | -1 | -2 | 4 | -3 | 1 | -2 | -1 | 0 | -3 | 1 | -2 | -1 | 0

the dot-product should be 759, but vector_matrix_multiply_s8 got 159.

so I got wrong result , do i need to set transpose_weights=false? I don't know where the setting is wrong, I need your help.

shiyuanren commented 1 year ago

when change the weight shape, to tensorShape(40, 512), and transpose it, get correct fc result. But it's slow than tflite-micro reference implementation.
image

Please help to see where is the problem Thanks

morgolock commented 1 year ago

Hi @shiyuanren

Could you please share the output of the command file build/libarm_compute.so ? You seem to have built a 32 bit binary? If so could you please try with a 64 bit build, the performance is better for aarch64

Hope this helps.