Open alvoron opened 1 month ago
Hi @alvoron
Thanks. I can reproduce the problem. FP32 performance for this specific configuration is better than FP16. It will require further investigation.
Hi @alvoron
The following patch solves the problem.
Make sure that in your test you enable fast_math
when calling NEDeconvolutionLayer::configure()
See below the following change in your test
NEDeconvolutionLayer deconv;
deconv.configure(&srcTensor, &weiTensor, nullptr, &dstTensor, deconv_info, /* enable fast match */ true);
std::cout << "PASSED CONFIGURATION" << std::endl;
[user@test_deconv]$ LD_LIBRARY_PATH=../ComputeLibrary/build/:$LD_LIBRARY_PATH ./test 1
F16
PASSED VALIDATION
PASSED CONFIGURATION
PASSED RUN: 151639
[user@test_deconv]$ LD_LIBRARY_PATH=../ComputeLibrary/build/:$LD_LIBRARY_PATH ./test
F32
PASSED VALIDATION
PASSED CONFIGURATION
PASSED RUN: 221537
Hope this helps.
@morgolock thank you for the patch, it works for me as well. Although, my diff between f32 and f16 is not so high as yours - I have 65-67 ms on f32 and 60-62 ms on f16. What machine was used to get results you shared above?
Hi @alvoron
I ran this on Neoverse N1.
I built the library with cons -j32 Werror=0 debug=0 neon=1 opencl=0 embed_kernels=0 validation_tests=1 os=linux arch=armv8a build=native multi_isa=1 fixed_format_kernels=1 openmp=1 cppthreads=0 asserts=0 logging=0 -j8
Make sure you use openmp=1 cppthreads=0
Hope this helps
NEDeconvolutionLayer
run()
with f16 tensors takes more time than NEDeconvolutionLayerrun()
with f32 tensors. On Ampere f32 version takes ~66 milliseconds, f16 version ~80 milliseconds.ACL build command:
Reproducer build command
Reproducer run commands:
The 1st command uses f32 tensors, the 2nd one - f16 tensors.
Reproducer: