Hi guys, I'm extremelly interested to speed up int8 MatMul inference with ARM Compute Library kernel. My model is:
graph TD;
Input1["Input
out: fp32"]
Quantise1["NEQuantizationLayer
out: signed int8"]
Input2["Input
out: fp32"]
Quantise2["NEQuantizationLayer
out: signed int8"]
MatMul["NEGEMMLowpMatrixMultiplyCore
out: signed int8"]
Input1-->Quantise1;
Input2-->Quantise2;
Quantise1-->MatMul;
Quantise2-->MatMul;
MatMul-->Result;
To make it possible I would like to use NEGEMMLowpMatrixMultiplyCore.
I have explored examples and found that the most suitable example is https://github.com/ARM-software/ComputeLibrary/blob/main/examples/neon_gemm_qasymm8.cpp. As I understand GEMMLowpOutputStageInfo is used to requantise output tensor. Unfortunately, it's standalone operation. I didn't find any example how I can requantise output tensor inside single NEGEMMLowpMatrixMultiplyCore kernel to avoid additional memory read/write operations.
During NEGEMMLowpMatrixMultiplyCore kernel implementation I found that the fuse is possible:
I changed a few lines of neon_gemm_qasymm8.cpp example to get working version. The commit: https://github.com/eshoguli/ComputeLibrary/commit/e4e38c53dd3a7b8ea75f2d30c500c80168f13ae2. But I didn't find any details about set_gemmlowp_output_stage in documentation and examples. So, as result, can I ask you, guys, quickly review the changes to be absolutelly sure the fuse of GEMMLowpOutputStageInfo into NEGEMMLowpMatrixMultiplyCore absolutelly correct?
Hi guys, I'm extremelly interested to speed up int8
MatMul
inference with ARM Compute Library kernel. My model is:To make it possible I would like to use
NEGEMMLowpMatrixMultiplyCore
.I have explored examples and found that the most suitable example is https://github.com/ARM-software/ComputeLibrary/blob/main/examples/neon_gemm_qasymm8.cpp. As I understand
GEMMLowpOutputStageInfo
is used to requantise output tensor. Unfortunately, it's standalone operation. I didn't find any example how I can requantise output tensor inside singleNEGEMMLowpMatrixMultiplyCore
kernel to avoid additional memory read/write operations.During
NEGEMMLowpMatrixMultiplyCore
kernel implementation I found that the fuse is possible:I changed a few lines of
neon_gemm_qasymm8.cpp
example to get working version. The commit: https://github.com/eshoguli/ComputeLibrary/commit/e4e38c53dd3a7b8ea75f2d30c500c80168f13ae2. But I didn't find any details aboutset_gemmlowp_output_stage
in documentation and examples. So, as result, can I ask you, guys, quickly review the changes to be absolutelly sure the fuse ofGEMMLowpOutputStageInfo
intoNEGEMMLowpMatrixMultiplyCore
absolutelly correct?