ARM-software / ComputeLibrary

The Compute Library is a set of computer vision and machine learning functions optimised for both Arm CPUs and GPUs using SIMD technologies.
MIT License
2.75k stars 767 forks source link

NEGEMMLowpMatrixMultiplyCore support type #1027

Closed zhen-jia closed 1 year ago

zhen-jia commented 1 year ago

Problem description: I am confused by the data type supported for NEGEMMLowpMatrixMultiplyCore. I find that the example (https://github.com/ARM-software/ComputeLibrary/blob/main/examples/neon_gemm_qasymm8.cpp#L220) uses input data type QASYMM8 and output data type S32. But when I read the code, I find here should tiger an error message: https://github.com/ARM-software/ComputeLibrary/blob/main/src/cpu/operators/internal/CpuGemmAssemblyDispatch.cpp#L792 However, I could run the example, without seeing the error message. But the condition (DataType::QASYMM8 && d->data_type() != DataType::QASYMM8) is true. I am confused. Could you help to explain what is the data type supported in NEGEMMLowpMatrixMultiplyCore? Thanks!

morgolock commented 1 year ago

Hi @zhen-jia

NEGEMMLowpMatrixMultiplyCore is implemented using CpuGemmLowpMatrixMultiplyCore, see details in https://github.com/ARM-software/ComputeLibrary/blob/main/src/runtime/NEON/functions/NEGEMMLowpMatrixMultiplyCore.cpp#L65

When you call NEGEMMLowpMatrixMultiplyCore::validate() you end up calling CpuGemmLowpMatrixMultiplyCore::validate() which supports S32 as can be seen in https://github.com/ARM-software/ComputeLibrary/blob/main/src/cpu/operators/CpuGemmLowpMatrixMultiplyCore.cpp#L313

You can see the data types accepted by CpuGemmLowpMatrixMultiplyCore in https://github.com/ARM-software/ComputeLibrary/blob/main/src/cpu/operators/CpuGemmLowpMatrixMultiplyCore.h#L78

 /** Initialise the kernel's inputs, output
     *
     * Valid data layouts:
     * - NHWC
     * - NCHW
     *
     * Valid data type configurations:
     * |src0           |src1               |src2     |dst            |
     * |:--------------|:------------------|:--------|:--------------|
     * |QASYMM8        |QASYMM8            |S32      |QASYMM8        |
     * |QASYMM8        |QSYMM8_PER_CHANNEL |S32      |QASYMM8        |
     * |QASYMM8        |QSYMM8             |S32      |QASYMM8        |
     * |QASYMM8        |QASYMM8            |S32      |S32            |
     * |QASYMM8        |QSYMM8_PER_CHANNEL |S32      |S32            |
     * |QASYMM8        |QSYMM8             |S32      |S32            |
     * |QASYMM8_SIGNED |QASYMM8_SIGNED     |S32      |QASYMM8_SIGNED |
     * |QASYMM8_SIGNED |QSYMM8_PER_CHANNEL |S32      |QASYMM8_SIGNED |
     * |QASYMM8_SIGNED |QSYMM8             |S32      |QASYMM8_SIGNED |
     * |QASYMM8_SIGNED |QASYMM8_SIGNED     |S32      |S32            |
     * |QASYMM8_SIGNED |QSYMM8_PER_CHANNEL |S32      |S32            |
     * |QASYMM8_SIGNED |QSYMM8             |S32      |S32            |
     */

CpuGemmAssemblyDispatch is a different class used internally in ACL to run assembly kernels.

Hope this helps.

zhen-jia commented 1 year ago

Thanks @morgolock for the help. One more question. Pytorch adopts fused kernel (fuse GEMM and de-quantization into one assembly kernel). Actually they are using dynamic quantization QNNPACK kernel. I am wondering that if ACL has some kernel like that? If I understand correctly, this folder (https://github.com/ARM-software/ComputeLibrary/tree/main/src/core/NEON/kernels/arm_gemm/kernels) only contains general GEMMs. Correct me if I was wrong. Thanks a lot.

morgolock commented 1 year ago

Hi @zhen-jia

You can find the GEMM highly optimized kernels in the folder https://github.com/ARM-software/ComputeLibrary/tree/main/src/core/NEON/kernels/arm_gemm/kernels

The quantization for is handled in https://github.com/ARM-software/ComputeLibrary/blob/main/src/core/NEON/kernels/arm_gemm/quantized.cpp#L59

Hope this helps.