ARM-software / ComputeLibrary

The Compute Library is a set of computer vision and machine learning functions optimised for both Arm CPUs and GPUs using SIMD technologies.
MIT License
2.75k stars 767 forks source link

NEReduceMean accuracy issue for NHWC #1044

Closed alvoron closed 1 month ago

alvoron commented 1 year ago

Output of 'strings libarm_compute.so | grep arm_compute_version': arm_compute_version=v23.02 Build options: {'neon': '1', 'opencl': '0', 'openmp': '0', 'cppthreads': '1', 'examples': '0', 'Werror': '0', 'gemm_tuner': '0', 'reference_openmp': '0', 'validation_tests': '0', 'benchmark_tests': '0', 'data_layout_support': 'all', 'build_dir': '/thirdparty/ComputeLibrary', 'install_dir': '/thirdparty/ComputeLibrary/install', 'arch': 'armv8.2-a', 'debug': '1', 'asserts': '1', 'logging': '1', 'os': 'macos', 'build': 'native', 'compiler_prefix': '/usr/bin/', 'extra_cxx_flags': '-fPIC -fsigned-char -ffunction-sections -fdata-sections -fdiagnostics-show-option -Wundef -Wreturn-type -Wunused-variable -Wswitch -Wno-macro-redefined -Wno-undef -Wno-missing-declarations -fvisibility-inlines-hidden -Wall -Wno-unknown-pragmas -fvisibility=internal -mcpu=native -Wno-undef -Wno-error=return-stack-address'} Git hash=b'f8f7ede7a01eb5cd9d06060b4d2f2d1404d93f29'

Platform: Apple M1

Operating System: macOS 12.6

Problem description: NEReduceMean passes 100% tests for NCHW layout and about 20% tests for NHWC layout due to accuracy issues. Does NEReduceMean expect the same data layout for NHWC as other operations: TensorShape(C, W, H, N)? Are there any other differences between NCHW and NHWC layouts in regard of NEReduceMean?

alvoron commented 1 year ago

Let me provide parameters: Precision: FP32 Source tensor shape: [2,2,2,2] Source tensor: 5 4 1 4 9 2 5 6 7 8 1 6 1 8 1 6 Indices: 0 1 Source tensor shape: [1,1,2,2] Output tensor provided by ACL: 3.5 4 5.5 5.5

Expected output tensor: 5.5 5.5 2 5.5 according to numpy:

data = np.array([[[[5,4],[1,4]],[[9,2],[5,6]]],[[[7,8],[1,6]],[[1,8],[1,6]]]],dtype=np.float32)
axes = np.array([0,1], dtype=np.int64)
print(np.mean(data, axis=tuple(axes), keepdims=True))
alvoron commented 1 year ago

NEReduceMean gives correct results if indices are 1 2 3 (result is 4.5 4.75) however it gives incorrect results for indices 0 1 3: ACL result is 3.75 5.5, numpy result is 5.5 3.75.

alvoron commented 5 months ago

The issue is reproducible on Raspberry Pi as well.

morgolock commented 4 months ago

Hi @alvoron

I've just tried the shapes and data you shared with tensorflow and I get the same results as ACL 3.75 5.5 for axis 0

>>> x = tf.constant([ [5. ,4.] , [1., 4.] , [ 9.,  2.],  [5., 6.] ,  [7. , 8.] , [ 1.,  6.] ,  [1., 8.],  [1., 6.]])
>>> tf.reduce_mean(x,0)                                                                                             
<tf.Tensor: shape=(2,), dtype=float32, numpy=array([3.75, 5.5 ], dtype=float32)>
>>> tf.reduce_mean(x,1)                                                                                            
<tf.Tensor: shape=(8,), dtype=float32, numpy=array([4.5, 2.5, 5.5, 5.5, 7.5, 3.5, 4.5, 3.5], dtype=float32)>

It would help if you could share a standalone test reproducing the problem, see the test below

  1 #include "arm_compute/core/Types.h"
  2 #include "arm_compute/runtime/NEON/NEFunctions.h"
  3 #include "utils/Utils.h"
  4 #include "tests/SimpleTensor.h"
  5 #include "arm_compute/runtime/Tensor.h"
  6 #include "utils/TypePrinter.h"
  7 using namespace std;
  8 using namespace arm_compute;
  9 using namespace arm_compute::test;
 10 
 11 
 12  int main()
 13 {
 14     NEReduceMean reduce_f;
 15     Tensor inputt;
 16     Tensor outputt;
 17     inputt.allocator()->init(TensorInfo(TensorShape(2, 2, 2, 2), 1, DataType::F32));
 18     float data[] = { 5., 4.,  1. , 4. , 9.,  2. , 5.,  6.,  7.,  8., 1.,  6., 1., 8., 1., 6. };
 19     inputt.allocator()->import_memory(data); 
 20     Coordinates axis(0);
 21     std::cout << "axis " << axis << std::endl;
 22     reduce_f.configure(&inputt,axis, true, &outputt);
 23     outputt.allocator()->allocate();
 24     cout << "input " << std::endl;
 25     inputt.print(std::cout);
 26     reduce_f.run();
 27     cout << "\noutput " << std::endl;
 28     outputt.print(std::cout);
 29 }   

When executed


LD_LIBRARY_PATH=.:$LD_LIBRARY_PATH ./mean 
axis 0
input 
5 4 
1 4 

9 2 
5 6 

7 8 
1 6 

1 8 
1 6 

output 
4.5 
2.5 

5.5 
5.5 

7.5 
3.5 

4.5 
3.5 

Hope this helps

alvoron commented 1 month ago

Indeed, there is no issue on ACL side. I observed the issue because of incorrect axis transformation nchw->nhwc.