NEDeconvolutionLayer f16 performance issue

alvoron commented 1 month ago

NEDeconvolutionLayer run() with f16 tensors takes more time than NEDeconvolutionLayer run() with f32 tensors. On Ampere f32 version takes ~66 milliseconds, f16 version ~80 milliseconds.

ACL build command:

scons arch=armv8.6-a neon=1 os=linux opencl=0 build=native -j 64 Werror=false validation_tests=1 fixed_format_kernels=1 multi_isa=1 openmp=0 cppthreads=1

Reproducer build command

g++ -O2 -g -I./ComputeLibrary -I./ComputeLibrary/include ~/avoron/acl_deconv.cpp -o bug -L./ComputeLibrary/build/ -larm_compute ./ComputeLibrary/build/tests/AssetsLibrary.o ./ComputeLibrary/build/tests/RawTensor.o ./ComputeLibrary/build/tests/framework/Exceptions.o -std=c++17

Reproducer run commands:

LD_LIBRARY_PATH=ComputeLibrary/build ./bug
LD_LIBRARY_PATH=ComputeLibrary/build ./bug 1

The 1st command uses f32 tensors, the 2nd one - f16 tensors.

Reproducer:

#include "arm_compute/core/Error.h"
#include "arm_compute/core/TensorShape.h"
#include "arm_compute/runtime/Tensor.h"
#include "arm_compute/runtime/NEON/NEFunctions.h"
#include "tests/Utils.h"
#include "tests/NEON/Accessor.h"
#include "tests/AssetsLibrary.h"

#include <iostream>
#include <vector>
#include <chrono>

using namespace arm_compute;
using namespace arm_compute::test;

int main(int argc, char *argv[]) {

    PadStrideInfo deconv_info = PadStrideInfo(3, 3, 0, 0, 0, 0, DimensionRoundingType::FLOOR);

    //f32 if no argument passed; f16 if any argument passed
    DataType dt = (argc == 1) ? DataType::F32 : DataType::F16;

    TensorInfo srcTensorInfo = TensorInfo(TensorShape(36, 640, 360, 1), 1, dt, DataLayout::NHWC);
    TensorInfo weiTensorInfo = TensorInfo(TensorShape(36, 3, 3, 4), 1, dt, DataLayout::NHWC);
    TensorInfo dstTensorInfo = TensorInfo(TensorShape(4, 1920, 1080, 1), 1, dt, DataLayout::NHWC);

    auto status = NEDeconvolutionLayer::validate(&srcTensorInfo, &weiTensorInfo, nullptr, &dstTensorInfo, deconv_info);
    if(status.error_code() != ErrorCode::OK) {
      std::cout << "ERROR: " << status.error_description().c_str() << std::endl;
      exit(1);
    }
    std::cout << "PASSED VALIDATION" << std::endl;

    Tensor srcTensor;
    Tensor weiTensor;
    Tensor dstTensor;

    srcTensor.allocator()->init(srcTensorInfo);
    weiTensor.allocator()->init(weiTensorInfo);
    dstTensor.allocator()->init(dstTensorInfo);

    NEDeconvolutionLayer deconv;
    deconv.configure(&srcTensor, &weiTensor, nullptr, &dstTensor, deconv_info);
    std::cout << "PASSED CONFIGURATION" << std::endl;

    srcTensor.allocator()->allocate();
    weiTensor.allocator()->allocate();
    dstTensor.allocator()->allocate();

    AssetsLibrary library(".", std::random_device()());
    std::uniform_real_distribution<> distribution{ 0.0f, 100.0f };
    library.fill(Accessor(srcTensor), distribution, 0);
    library.fill(Accessor(weiTensor), distribution, 0);

    //warm-up
    deconv.run();

    std::chrono::high_resolution_clock::time_point start = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < 100; i++) deconv.run();
    std::chrono::high_resolution_clock::time_point finish = std::chrono::high_resolution_clock::now();
    std::cout << "PASSED RUN: " << std::chrono::duration_cast<std::chrono::microseconds>(finish - start).count() / 100 << std::endl;

    srcTensor.allocator()->free();
    weiTensor.allocator()->free();
    dstTensor.allocator()->free();

    return 0;
}

morgolock commented 1 month ago

Hi @alvoron

Thanks. I can reproduce the problem. FP32 performance for this specific configuration is better than FP16. It will require further investigation.

morgolock commented 3 weeks ago

Hi @alvoron

The following patch solves the problem.

Make sure that in your test you enable fast_math when calling NEDeconvolutionLayer::configure()

See below the following change in your test

NEDeconvolutionLayer deconv;
deconv.configure(&srcTensor, &weiTensor, nullptr, &dstTensor, deconv_info, /* enable fast match */ true);
std::cout << "PASSED CONFIGURATION" << std::endl;

[user@test_deconv]$ LD_LIBRARY_PATH=../ComputeLibrary/build/:$LD_LIBRARY_PATH ./test 1
F16
PASSED VALIDATION
PASSED CONFIGURATION
PASSED RUN: 151639
[user@test_deconv]$ LD_LIBRARY_PATH=../ComputeLibrary/build/:$LD_LIBRARY_PATH ./test 
F32
PASSED VALIDATION
PASSED CONFIGURATION
PASSED RUN: 221537

Hope this helps.

alvoron commented 1 week ago

@morgolock thank you for the patch, it works for me as well. Although, my diff between f32 and f16 is not so high as yours - I have 65-67 ms on f32 and 60-62 ms on f16. What machine was used to get results you shared above?

morgolock commented 1 week ago

Hi @alvoron

I ran this on Neoverse N1.

I built the library with cons -j32 Werror=0 debug=0 neon=1 opencl=0 embed_kernels=0 validation_tests=1 os=linux arch=armv8a build=native multi_isa=1 fixed_format_kernels=1 openmp=1 cppthreads=0 asserts=0 logging=0 -j8

Make sure you use openmp=1 cppthreads=0

Hope this helps

ARM-software / ComputeLibrary

NEDeconvolutionLayer f16 performance issue #1129