ARM-software / ComputeLibrary

The Compute Library is a set of computer vision and machine learning functions optimised for both Arm CPUs and GPUs using SIMD technologies.
2.87k stars 782 forks source link

NEGEMMLowpMatrixMultiplyCore: QASYMM8 src1 & QASYMM8_SIGNED src2 support #1124

Closed eshoguli closed 2 months ago

eshoguli commented 4 months ago

In accordance with documentation NEGEMMLowpMatrixMultiplyCore suports only limited combinations of QSYMM8 and QASYMM8_SIGNED precisions on inputs:

src0 src1 src2 dst
QASYMM8_SIGNED QSYMM8 S32 QASYMM8_SIGNED
QASYMM8_SIGNED QSYMM8 S32 S32

But we need to support QSYMM8 on src1 and QASYMM8_SIGNED on src2. Why this combinations is not supported? Can I use shift / zero-point in the second NEQuantizationLayer to resolve the issue?

Are you going to support QSYMM8 on src1 and QASYMM8_SIGNED on src2 in the future? Thanks!

[UPD] Note, please, I modfied examples to check QSYMM8 and QASYMM8_SIGNED on inputs support. You can easily explore source code here: https://github.com/eshoguli/ComputeLibrary/commit/28c57d4f8de6df37d8edd031362160d76fda079e. There are no any validation exceptions for QSYMM8 and QASYMM8_SIGNED inputs but output results are not correct.

Tensors logging for QSYMM8 and QASYMM8_SIGNED with incorrect results examples/neon_gemm_u8s8_s32.cpp:

./build/examples/neon_gemm_u8s8_s32
Usage: ./build/neon_gemm_qasymm8 M N K
Too few or no inputs provided. Using default M=4, N=4, K=4

q_src1 QASYMM8:
25  0  0  0 
 0 25  0  0 
 0  0 25  0 
 0  0  0 25 

q_src2 QASYMM8_SIGNED:
  0   2  -3   5 
 -7   9 -10  12 
-14  15 -17  19 
-20  22 -24  26 

Lowp GEMM output S32:
   0   50 6325  125 
6225  225 6150  300 
6050  375 5975  475 
5900  550 5800  650 

Tensors logging for QASYMM8_SIGNED and QASYMM8_SIGNED with correct results as reference examples/neon_gemm_s8s8_s32.cpp:

./build/examples/neon_gemm_s8s8_s32
Usage: ./build/neon_gemm_qasymm8 M N K
Too few or no inputs provided. Using default M=4, N=4, K=4

find_implementation: a64_hybrid_s8s32_dot_6x16
find_implementation: a64_hybrid_s8s32_dot_6x16
find_implementation: a64_hybrid_s8s32_dot_6x16
q_src1 QASYMM8_SIGNED:
25  0  0  0 
 0 25  0  0 
 0  0 25  0 
 0  0  0 25 

q_src2 QASYMM8_SIGNED:
  0   2  -3   5 
 -7   9 -10  12 
-14  15 -17  19 
-20  22 -24  26 

Lowp GEMM output S32:
   0   50  -75  125 
-175  225 -250  300 
-350  375 -425  475 
-500  550 -600  650 
ramelg01 commented 3 months ago

Hi @eshoguli , By next Monday, 05 August, I will get a clear answer on when this new feature will be provided. Thanks

eshoguli commented 3 months ago

Tested on commit:

commit c5dd7753d0475ffec0f192f3181fe67a1d761680 (tag: v24.07, origin/main, origin/HEAD, main)
Author: Jenkins <bsgcomp@arm.com>
Date:   Fri Jul 26 12:07:30 2024 +0000

    Compute Library v24.07

How to easilly reproduce branch: es/aarch64/neon_gemm_u8i8_support/ example files:

build: scons arch=arm64-v8.2-a neon=1 opencl=0 openmp=0 cppthreads=1 os=macos data_layout_support=all build=native asserts=1 --jobs=8 --silent os=macos build=native fixed_format_kernels=True validation_tests=1 examples=1 debug=0 run: ./build/examples/neon_gemm_u8s8_s32_comparision

expected results: 120 for each result matrix item. Ectual value is 7560. Note, please, if we update signed value -2 to 2 here: https://github.com/eshoguli/ComputeLibrary/blob/es/aarch64/neon_gemm_u8i8_support/examples/neon_gemm_u8s8_f32_comparision.cpp#L174, then results will be OK.

morgolock commented 3 months ago

Hi @eshoguli

The following patch adds mixed sign support in GEMM and has already been merged to main. I made some changes to your test neon_gemm_u8s8_f32_comparision.cpp to also compute SGEMM and compare the output with GEMMLOWP. As you can see below the output is -12 in both cases.

root@hikey:~/tmp/user/github# LD_LIBRARY_PATH=./:$LD_LIBRARY_PATH ./neon_gemm_u8s8_f32_comparision  3 3 3
src1 F32 [6, 16, 1, 1]:
1 1 1 1 1 1 
1 1 1 1 1 1 
1 1 1 1 1 1 
1 1 1 1 1 1 
1 1 1 1 1 1 
1 1 1 1 1 1 
1 1 1 1 1 1 
1 1 1 1 1 1 
1 1 1 1 1 1 
1 1 1 1 1 1 
1 1 1 1 1 1 
1 1 1 1 1 1 
1 1 1 1 1 1 
1 1 1 1 1 1 
1 1 1 1 1 1 
1 1 1 1 1 1 

src2 F32 [16, 6, 1, 1]:
-2 -2 -2 -2 -2 -2 -2 -2 -2 -2 -2 -2 -2 -2 -2 -2 
-2 -2 -2 -2 -2 -2 -2 -2 -2 -2 -2 -2 -2 -2 -2 -2 
-2 -2 -2 -2 -2 -2 -2 -2 -2 -2 -2 -2 -2 -2 -2 -2 
-2 -2 -2 -2 -2 -2 -2 -2 -2 -2 -2 -2 -2 -2 -2 -2 
-2 -2 -2 -2 -2 -2 -2 -2 -2 -2 -2 -2 -2 -2 -2 -2 
-2 -2 -2 -2 -2 -2 -2 -2 -2 -2 -2 -2 -2 -2 -2 -2 

q_src1 QASYMM8_SIGNED [6, 16, 1, 1]:
5 5 5 5 5 5 
5 5 5 5 5 5 
5 5 5 5 5 5 
5 5 5 5 5 5 
5 5 5 5 5 5 
5 5 5 5 5 5 
5 5 5 5 5 5 
5 5 5 5 5 5 
5 5 5 5 5 5 
5 5 5 5 5 5 
5 5 5 5 5 5 
5 5 5 5 5 5 
5 5 5 5 5 5 
5 5 5 5 5 5 
5 5 5 5 5 5 
5 5 5 5 5 5 

q_src2 QASYMM8_SIGNED [16, 6, 1, 1]:
-4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 
-4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 
-4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 
-4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 
-4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 
-4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 

Lowp GEMM output F32 [16, 16, 1, 1]:
-12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 
-12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 
-12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 
-12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 
-12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 
-12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 
-12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 
-12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 
-12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 
-12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 
-12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 
-12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 
-12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 
-12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 
-12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 
-12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 

SGEMM F32 [16, 16, 1, 1]:
-12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 
-12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 
-12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 
-12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 
-12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 
-12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 
-12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 
-12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 
-12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 
-12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 
-12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 
-12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 
-12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 
-12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 
-12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 
-12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 -12 

In your test I just added the following code at the end to print the output of sgemm

////
236     NEGEMM fgemm{};
237 
238     Tensor dst;
239     dst.allocator()->init(TensorInfo(TensorShape(16, 16, 1, 1), 1, DataType::F32));
240     fgemm.configure(&src1, &src2, nullptr, &dst, 1, 0);
241     dst.allocator()->allocate();
242     fgemm.run();
243 
244     
245     // Print sgemm output
246     std::cout << "SGEMM " << dst.info() << ":" << std::endl;
247     dst.print(std::cout); 
248 
249     
250     
251 
252     
253     return 0;
254 }
eshoguli commented 3 months ago

Validated: case with QASYMM8 + QASYMM8_SIGNED inputs and F32 output is supported https://review.mlplatform.org/ml/ComputeLibrary, thanks! Note, please, the fix has not yet been applied in https://github.com/ARM-software/ComputeLibrary