NEGEMMLowpMatrixMultiplyCore: GEMMLowpOutputStageInfo fusing to speed up inference

Hi guys, I'm extremelly interested to speed up int8 MatMul inference with ARM Compute Library kernel. My model is:

graph TD;
    Input1["Input
    out: fp32"]
    Quantise1["NEQuantizationLayer
    out: signed int8"]
    Input2["Input
    out: fp32"]
    Quantise2["NEQuantizationLayer
    out: signed int8"]
    MatMul["NEGEMMLowpMatrixMultiplyCore
    out: signed int8"]

    Input1-->Quantise1;
    Input2-->Quantise2;
    Quantise1-->MatMul;
    Quantise2-->MatMul;
    MatMul-->Result;

To make it possible I would like to use NEGEMMLowpMatrixMultiplyCore.

I have explored examples and found that the most suitable example is https://github.com/ARM-software/ComputeLibrary/blob/main/examples/neon_gemm_qasymm8.cpp. As I understand GEMMLowpOutputStageInfo is used to requantise output tensor. Unfortunately, it's standalone operation. I didn't find any example how I can requantise output tensor inside single NEGEMMLowpMatrixMultiplyCore kernel to avoid additional memory read/write operations.

During NEGEMMLowpMatrixMultiplyCore kernel implementation I found that the fuse is possible:

GEMMInfo gemm_info;
gemm_info.set_gemmlowp_output_stage(info);
q_res.allocator()->init(TensorInfo(TensorShape(N, M), 1, DataType::QASYMM8));
qgemm.configure(&q_src1, &q_src2, nullptr, &q_res, gemm_info);

I changed a few lines of neon_gemm_qasymm8.cpp example to get working version. The commit: https://github.com/eshoguli/ComputeLibrary/commit/e4e38c53dd3a7b8ea75f2d30c500c80168f13ae2. But I didn't find any details about set_gemmlowp_output_stage in documentation and examples. So, as result, can I ask you, guys, quickly review the changes to be absolutelly sure the fuse of GEMMLowpOutputStageInfo into NEGEMMLowpMatrixMultiplyCore absolutelly correct?

ARM-software / ComputeLibrary

NEGEMMLowpMatrixMultiplyCore: GEMMLowpOutputStageInfo fusing to speed up inference #1120