NEGEMMLowpMatrixMultiplyCore produces wrong result when zero_point > 0

Hi,

I ran into an issue where NEGEMMLowpMatrixMultiplyCore produces very large INT32 output when the zero_point values of its input matrices are > 0. As a result, when NEGEMMLowpOutputStage takes the INT32 values and performs fixed-point multiplication, the result will overflow (i.e. > 255).

The zero_point values of input matrices can be > 0 when we quantize negative FLOAT32 numbers to QASYMM8.

I think the root cause of the issue is the interpretation of zero_point's signedness. Here it states that offsets are to be added. Instead based on a basic derivation of arithmetics, they should be subtracted:

C_ij = offsetC + (scaleA * scaleB / scaleC) * (sum_k(A_k - offsetA) * (B_k - offsetB))

May I know why the offset values are being added, or is there something else that I missed? Here's the comment in code:

     *  -# Convert a values from QASYMM8 to int32 and add a_offset to each of them.
     *  -# Convert b values from QASYMM8 to int32 add b_offset to each of them.
     *  -# Compute the matrix product of the resulting a * b in int32.

To see the issue, simply use this example and change the following random generator from:

    if(!default_input)
    {
        fill_random_tensor(src1, 0.f, 1.f);
        fill_random_tensor(src2, 0.f, 1.f);
    }

to:

    if(!default_input)
    {
        fill_random_tensor(src1, -1.f, 1.f);
        fill_random_tensor(src2, -1.f, 1.f);
    }

Run the test:

LD_LIBRARY_PATH=build/arm64/ ./build/arm64/examples/neon_gemm_qasymm8 5 5 5
Result matrix:
-0.399418 -0.564115  0.936454 -0.827034 -0.258425 
 -0.68435 -0.847087  0.404828 -0.687768   0.73089 
 0.748448  0.482663   -0.7427 -0.377711 -0.204772 
-0.690029  0.236893  0.301625 -0.152931   0.48284 
-0.687417  0.790109 -0.930816 -0.419816  0.026661 

 -0.895911  -0.813287  -0.850851   -0.70215   0.175594 
 -0.114651  -0.574359   0.342805   0.921813   0.136532 
 -0.307088  -0.310053   0.241041   0.966209  -0.645821 
  0.492036   -0.36847   -0.13248  -0.797518    -0.7209 
 -0.678681   0.463645 0.00110865  -0.746313    0.65107 

-0.0965967   0.543415   0.481467    1.51769  -0.323982 
  -0.24853    1.50988   0.481399  0.0938432   0.474404 
  -0.54468  -0.611415  -0.600567  -0.344145   0.815945 
 0.0954776   0.611827   0.761819   0.755923   0.140993 
  0.586462   0.560915   0.687025   0.626553   0.908313 

Matrix 1: min=-0.930816, max=0.936454, QuantisationInfo(0.00732263, 127)
Matrix 2: min=-0.895911, max=0.966209, QuantisationInfo(0.00730243, 123)
Result  : min=-0.611415, max=1.51769, QuantisationInfo(0.00834945, 73)
(q_multiplier, q_shift) = (1760421504, 7)

Test Passed
 72  50 255  14  92 
 34  11 182  33 227 
229 193  26  75  99 
 33 159 168 106 193 
 33 235   0  70 131 

  0  12   6  27 147 
107  44 170 249 142 
 81  81 156 255  35 
190  73 105  14  24 
 30 186 123  21 212 

Lowp GEMM output (int32):
220755 229659 270132 290943 255083 
218821 248563 270978 265184 270972 
246404 242138 284023 290492 310498 
267529 274090 318587 319903 307051 
229983 226296 270523 270811 274585 <== TOO BIG

Output pipeline result matrix:
255 255 255 255 255 
255 255 255 255 255 
255 255 255 255 255 
255 255 255 255 255 
255 255 255 255 255 <== OVERFLOW

Expected result:
 61 138 131 255  34 
 43 254 131  84 130 
  8   0   1  32 171 
 84 146 164 164  90 
143 140 155 148 182

Note: If I change the "addition" to "subtraction" in the equation, then I can get the correct result in the above example.

Thanks!

ARM-software / ComputeLibrary

NEGEMMLowpMatrixMultiplyCore produces wrong result when zero_point > 0 #1033