ARM-software / ComputeLibrary

The Compute Library is a set of computer vision and machine learning functions optimised for both Arm CPUs and GPUs using SIMD technologies.
MIT License
2.82k stars 774 forks source link

Low performance with NETranspose on aarch64 #1045

Closed yd2102 closed 1 year ago

yd2102 commented 1 year ago

Output of 'strings libarm_compute.so | grep arm_compute_version': arm_compute_version=v23.02.1 Build options: {'Werror': '1', 'debug': '0', 'neon': '1', 'opencl': '0', 'os': 'linux', 'openmp': '1', 'cppthreads': '0', 'arch': 'armv8.2-a', 'multi_isa': '1', 'build': 'native'} Git hash=b'd8bf9b53752a4f573120cf51b31055de8b3c7d29'

Platform: AWS Graviton3 aarch64 (ARMv8.4-a)

Operating System: 23~22.04.1-Ubuntu

Problem description:

Hi,

I am experiencing low performance when trying to compute ABT where A and B are matrices of shapes [M, K] and [N, K] respectively. Such pattern of computation is very common in modern transformer-based ML models, so it is important that we compute this efficiently.

What I've found so far in ACL's repo is in order to compute ABT, we need to compute BT first, and then compute dot product of A and BT.

Based on the result from linux profiler, it shows that > 60% of time is spent on matrix transpose, which isn't expected because transpose is a much more lightweight operation than GEMM itself.

So the questions are: 1) Is there a more optimized NETranspose kernel in ACL other than "transpose_32bit_elements" that I can configure (my processor supports ARM SVE)? 2) I think an even more optimized approach is to handle ABT in GEMM's tiled kernel without having to compute transpose and GEMM separately. Does ACL support this kind of fused computation?

Thanks!

The linux profiler shows > 60% of time is spent on matrix transpose:

# Overhead  Command     Shared Object        Symbol                                                                                                                                                    
# ........  ..........  ...................  ..........................................................................................................................................................
#
    68.80%  neon_sgemm  libarm_compute.so    [.] arm_compute::cpu::kernels::(anonymous namespace)::transpose_32bit_elements
    30.54%  neon_sgemm  libarm_compute.so    [.] arm_gemm::sve_hybrid_fp32_mla_6x4VL
     0.21%  neon_sgemm  libarm_compute.so    [.] arm_gemm::GemmHybridIndirect<arm_gemm::cls_sve_hybrid_fp32_mla_6x4VL, float, float, arm_gemm::Nothing, false, false>::execute
     0.04%  neon_sgemm  libarm_compute.so    [.] arm_compute::cpu::kernels::CpuTransposeKernel::run_op
     0.04%  neon_sgemm  libstdc++.so.6.0.30  [.] std::__detail::_Prime_rehash_policy::_M_need_rehash
     0.04%  neon_sgemm  libarm_compute.so    [.] arm_compute::cpu::CpuGemm::run
     0.04%  neon_sgemm  libc.so.6            [.] _int_free
     0.04%  neon_sgemm  libarm_compute.so    [.] arm_compute::Scheduler::get
     0.04%  neon_sgemm  libarm_compute.so    [.] arm_gemm::(anonymous namespace)::run_hybrid_kernel<arm_gemm::Nothing, false, false>::run<arm_gemm::cls_sve_hybrid_fp32_mla_6x4VL, float, float, float>
     0.02%  neon_sgemm  libarm_compute.so    [.] arm_compute::ITensorPack::get_const_tensor@plt

Here's the short version of my code to see the problem (I use single thread):

#include "arm_compute/core/Types.h"
#include "arm_compute/runtime/NEON/NEFunctions.h"
#include "arm_compute/runtime/NEON/NEScheduler.h"
#include "utils/Utils.h"

#include <cstdlib>
#include <chrono>

using namespace arm_compute;

static const size_t M = 10;
static const size_t N = 768;
static const size_t K = 768;
static const size_t iterations = 100000;
static const float alpha = 1.f;
static const float beta = 0.f;

void benchmark(IFunction *trans, IFunction *gemm, const int threads)
{
    printf("Using %d threads...\n", threads);

    // Use specified number of threads
    NEScheduler::get().set_num_threads(threads);

    // Warm up kernel
    for (size_t i = 0; i < 100; i++)
    {
        if (trans)
        {
            trans->run();
        }

        gemm->run();
    }

    size_t total = threads * iterations;
    auto start = std::chrono::steady_clock::now();

    // Execute kernel
    for (size_t i = 0; i < total; i++)
    {
        if (trans)
        {
            trans->run();
        }

        gemm->run();
    }

    auto stop = std::chrono::steady_clock::now();
    std::chrono::duration<double> diff = stop - start;
    double time = diff.count();

    printf("%f ms/iter\n", 1e3 * time / total);
}

void test_gemm()
{
    printf("M=%ld, N=%ld, K=%ld\n", M, N, K);

    Tensor      src0;
    Tensor      src1;
    Tensor      dst;
    NEGEMM      fgemm;

    // Populate tensor information
    src0.allocator()->init(TensorInfo(TensorShape(K, M), 1, DataType::F32));
    src1.allocator()->init(TensorInfo(TensorShape(N, K), 1, DataType::F32));
    dst.allocator()->init(TensorInfo(TensorShape(N, M), 1, DataType::F32));

    // Configure kernel
    fgemm.configure(&src0, &src1, nullptr, &dst, alpha, beta);

    // Allocate all tensors
    src0.allocator()->allocate();
    src1.allocator()->allocate();
    dst.allocator()->allocate();

    // Initialize random inputs
    utils::fill_random_tensor(src0, -1.f, 1.f);
    utils::fill_random_tensor(src1, -1.f, 1.f);

    // Run benchmarking
    benchmark(nullptr, &fgemm, 1);
}

void test_gemm_transpose()
{
    printf("M=%ld, N=%ld, K=%ld\n", M, N, K);

    Tensor      src0;
    Tensor      src1;
    Tensor      src1t;
    Tensor      dst;
    NETranspose trans;
    NEGEMM      fgemm;

    // Populate tensor information
    src0.allocator()->init(TensorInfo(TensorShape(K, M), 1, DataType::F32));
    src1.allocator()->init(TensorInfo(TensorShape(K, N), 1, DataType::F32));
    src1t.allocator()->init(TensorInfo(TensorShape(N, K), 1, DataType::F32));
    dst.allocator()->init(TensorInfo(TensorShape(N, M), 1, DataType::F32));

    // Configure kernel
    trans.configure(&src1, &src1t);
    fgemm.configure(&src0, &src1t, nullptr, &dst, alpha, beta);

    // Allocate all tensors
    src0.allocator()->allocate();
    src1.allocator()->allocate();
    src1t.allocator()->allocate();
    dst.allocator()->allocate();

    // Initialize random inputs
    utils::fill_random_tensor(src0, -1.f, 1.f);
    utils::fill_random_tensor(src1, -1.f, 1.f);

    // Run benchmarking
    benchmark(&trans, &fgemm, 1);
}

int main(int argc, char **argv)
{
    (void)argc;
    (void)argv;

    // test_gemm();
    test_gemm_transpose();

    return 0;
}
GGGGxxxxxxxxr commented 1 year ago

The NETranspose and NEReshaping are both relatively slow on Armv8 according to my own test. Same issue.

yd2102 commented 1 year ago

There appears to be fixed format GEMM kernels suitable for computing ABT, but I'm not sure such kernels are usable in this case. Is there example code that shows how I can compute ABT using fixed format Neon kernels?

nSircombe commented 1 year ago

@yd2102

I think the only practical example at present is within this oneDNN PR here - https://github.com/oneapi-src/oneDNN/pull/1590 (the changes to matmul, and the supporting acl_matmul_utils.cpp show the absorption of an NETranspose of the "B" matrix with a re-order into the memory format expected for the fixed format kernels.