ARM-software / ComputeLibrary

The Compute Library is a set of computer vision and machine learning functions optimised for both Arm CPUs and GPUs using SIMD technologies.
2.87k stars 782 forks source link

NEReduceMean accuracy issue for NHWC #1044

Closed alvoron closed 6 months ago

alvoron commented 1 year ago

Output of 'strings libarm_compute.so | grep arm_compute_version': arm_compute_version=v23.02 Build options: {'neon': '1', 'opencl': '0', 'openmp': '0', 'cppthreads': '1', 'examples': '0', 'Werror': '0', 'gemm_tuner': '0', 'reference_openmp': '0', 'validation_tests': '0', 'benchmark_tests': '0', 'data_layout_support': 'all', 'build_dir': '/thirdparty/ComputeLibrary', 'install_dir': '/thirdparty/ComputeLibrary/install', 'arch': 'armv8.2-a', 'debug': '1', 'asserts': '1', 'logging': '1', 'os': 'macos', 'build': 'native', 'compiler_prefix': '/usr/bin/', 'extra_cxx_flags': '-fPIC -fsigned-char -ffunction-sections -fdata-sections -fdiagnostics-show-option -Wundef -Wreturn-type -Wunused-variable -Wswitch -Wno-macro-redefined -Wno-undef -Wno-missing-declarations -fvisibility-inlines-hidden -Wall -Wno-unknown-pragmas -fvisibility=internal -mcpu=native -Wno-undef -Wno-error=return-stack-address'} Git hash=b'f8f7ede7a01eb5cd9d06060b4d2f2d1404d93f29'

Platform: Apple M1

Operating System: macOS 12.6

Problem description: NEReduceMean passes 100% tests for NCHW layout and about 20% tests for NHWC layout due to accuracy issues. Does NEReduceMean expect the same data layout for NHWC as other operations: TensorShape(C, W, H, N)? Are there any other differences between NCHW and NHWC layouts in regard of NEReduceMean?

alvoron commented 1 year ago

Let me provide parameters: Precision: FP32 Source tensor shape: [2,2,2,2] Source tensor: 5 4 1 4 9 2 5 6 7 8 1 6 1 8 1 6 Indices: 0 1 Source tensor shape: [1,1,2,2] Output tensor provided by ACL: 3.5 4 5.5 5.5

Expected output tensor: 5.5 5.5 2 5.5 according to numpy:

data = np.array([[[[5,4],[1,4]],[[9,2],[5,6]]],[[[7,8],[1,6]],[[1,8],[1,6]]]],dtype=np.float32)
axes = np.array([0,1], dtype=np.int64)
print(np.mean(data, axis=tuple(axes), keepdims=True))
alvoron commented 1 year ago

NEReduceMean gives correct results if indices are 1 2 3 (result is 4.5 4.75) however it gives incorrect results for indices 0 1 3: ACL result is 3.75 5.5, numpy result is 5.5 3.75.

alvoron commented 10 months ago

The issue is reproducible on Raspberry Pi as well.

morgolock commented 9 months ago

Hi @alvoron

I've just tried the shapes and data you shared with tensorflow and I get the same results as ACL 3.75 5.5 for axis 0

>>> x = tf.constant([ [5. ,4.] , [1., 4.] , [ 9.,  2.],  [5., 6.] ,  [7. , 8.] , [ 1.,  6.] ,  [1., 8.],  [1., 6.]])
>>> tf.reduce_mean(x,0)                                                                                             
<tf.Tensor: shape=(2,), dtype=float32, numpy=array([3.75, 5.5 ], dtype=float32)>
>>> tf.reduce_mean(x,1)                                                                                            
<tf.Tensor: shape=(8,), dtype=float32, numpy=array([4.5, 2.5, 5.5, 5.5, 7.5, 3.5, 4.5, 3.5], dtype=float32)>

It would help if you could share a standalone test reproducing the problem, see the test below

  1 #include "arm_compute/core/Types.h"
  2 #include "arm_compute/runtime/NEON/NEFunctions.h"
  3 #include "utils/Utils.h"
  4 #include "tests/SimpleTensor.h"
  5 #include "arm_compute/runtime/Tensor.h"
  6 #include "utils/TypePrinter.h"
  7 using namespace std;
  8 using namespace arm_compute;
  9 using namespace arm_compute::test;
 10 
 11 
 12  int main()
 13 {
 14     NEReduceMean reduce_f;
 15     Tensor inputt;
 16     Tensor outputt;
 17     inputt.allocator()->init(TensorInfo(TensorShape(2, 2, 2, 2), 1, DataType::F32));
 18     float data[] = { 5., 4.,  1. , 4. , 9.,  2. , 5.,  6.,  7.,  8., 1.,  6., 1., 8., 1., 6. };
 19     inputt.allocator()->import_memory(data); 
 20     Coordinates axis(0);
 21     std::cout << "axis " << axis << std::endl;
 22     reduce_f.configure(&inputt,axis, true, &outputt);
 23     outputt.allocator()->allocate();
 24     cout << "input " << std::endl;
 25     inputt.print(std::cout);
 26     reduce_f.run();
 27     cout << "\noutput " << std::endl;
 28     outputt.print(std::cout);
 29 }   

When executed


LD_LIBRARY_PATH=.:$LD_LIBRARY_PATH ./mean 
axis 0
input 
5 4 
1 4 

9 2 
5 6 

7 8 
1 6 

1 8 
1 6 

output 
4.5 
2.5 

5.5 
5.5 

7.5 
3.5 

4.5 
3.5 

Hope this helps

alvoron commented 6 months ago

Indeed, there is no issue on ACL side. I observed the issue because of incorrect axis transformation nchw->nhwc.