Slower performance of OpenBLAS for inverse operation on ARMv7

sohailshuaib commented 7 years ago

Hi All,

I have done some benchmarking for multiplication and inverse operations for ARMv7. I am using the armadillo c++ library which itself uses BLAS and LAPACK. The performance of OpenBLAS for multiplication is way better than the normal BLAS. Howevever, for inverse, armadillo makes use of LAPACK routines which in turn use BLAS implementations to compute the inverse using LU factorization. Using OpenBLAS leads to a slower performance compared to the normal unoptimized BLAS. At the first sight, this does not make sense because the LAPACK routine make use of BLAS for solving the LU factorization problem, or is it some extra overhead (vectorization) being incurred due to the particular nature of LU factorization. or something else is at play here?

I am using buildroot to cross compile the OpenBLAS for ARM A53.

martin-frbg commented 7 years ago

Slow performance of (banded) LU factorization was reported in #901 but by all accounts appeared to be specific to Haswell systems running OSX. Could you run your test code on some other platform to see if it is generic or limited to the ARMv7 target, or could you provide a (small,self-contained) testcase ?

sohailshuaib commented 7 years ago

Thanks for the comment @martin-frbg . Well, i indeed ran my benchmarking code for Intel Haswel running CentOS, and experienced similar behaviour. However, for a higher matrix size, e.g., 1000, OpenBLAS was faster. This led to me conclude, in my first post, that some overhead of vectorization could be at play here. However, i could not replicate the same scenario on ARM; increasing the matrix size took longer for OpenBLAS.

brada4 commented 7 years ago

Does it work faster with single thread than many?

brada4 commented 7 years ago

Can you extract BLAS or LAPACK call that is 'SLOW'? (or provide code sample so we can do it) Can you show situation where it is 'FAST' for comparison?

martin-frbg commented 7 years ago

Did you build OpenBLAS with the included LAPACK or without (i.e. make NO_LAPACK=1) ? I found out the hard way today that the LAPACK compile step within OpenBLAS does not inherit the default "-O2" optimization level of its BLAS counterpart. So if one not does not set the FFLAGS environment variable or edit Makefile.system accordingly, one may end up with a set of LAPACK modules that perform noticably poorer than a regular "Release mode" build of the exact same sources from netlib...

sohailshuaib commented 7 years ago

@brada4 I only used a single thread. And it is the cgetrf cgerti routines of LAPACK that compute the inverse using LU factorization and slower.

sohailshuaib commented 7 years ago

@martin-frbg I indeed use NO_LAPACK=1 because my toolchain does not have a cross fortran compiler and build a CLAPACK package separately linking it with OpenBLAS manually.

sohailshuaib commented 7 years ago

Unfortunately i do not have a self contained test case. One has to have armadillo installed (which is fairly straightforward) to run it. Would that be possible for you @martin-frbg ?

xianyi commented 7 years ago

OpenBLAS didn't use ARM Neon (SIMD) instruction for ARMv7. Therefore, the performance is not good.

sohailshuaib commented 7 years ago

@xianyi the performance is not good when computing the inverse. It is definitely good while doing multiplication. Are there any plans for supporting ARM Neon instructions in the future?

sohailshuaib commented 7 years ago

@xianyi Does OpenBLAS with ARMv8 target use ARM Neon instructions?

martin-frbg commented 7 years ago

According to #854 the ARMv8 target optimization uses Advanced SIMD (NEON) instructions.

sohailshuaib commented 7 years ago

@martin-frbg I used the lapack that comes with OpenBLAS, and indeed, as you mentioned, it is worse than the standalone lapack. I tried passing -O2 opimization flags in the Makefile.system but it does not seem to work. Also i see that OPT variable in the make.inc correctly holds the -O2 optimization flag. How were you able to do it?

brada4 commented 7 years ago

Just clone a fresh copy of develop branch.

martin-frbg commented 7 years ago

That problem should be solved in the "develop" tree now, basically you need to add the "-O2" option on the line that starts with "override FFLAGS" (around line 1000 in Makefile.system) so that it is seen by the fortran compiler (and not only by the c compiler used when building blas)

sirop commented 7 years ago

@martin-frbg https://github.com/xianyi/OpenBLAS/issues/1045#issuecomment-271972504 gave me an at least 30% speed up with LAPACKE_dgels on atom cpu with a 3x3 matrix. Big leap for me. :)

Thanks.

brada4 commented 7 years ago

inversion/decomposition is at complexity above O(dimension^3) while multiplication is below that. All the locality optimisations will only give linear speedup. Unless you make algorithmic breakthrough there is no chance it gets simpler.

sirop commented 7 years ago

@brada4 I believe you, but then it means that the multiplication within LAPACKE_dgels was so much slower before and got so much faster now?

martin-frbg commented 7 years ago

That complexity comment was probably adressed to sohailshuaib who started this topic, but I guess it is largely irrelevant as the issue was not that the computation appears to be slower in general but slower when done with OpenBLAS.

sohailshuaib commented 7 years ago

@martin-frbg @brada4 @xianyi there is some further update regarding the topic. I have done some profiling of using lapack with OpenBLAS vs netlib-blas on an Intel Haswell architecture. At first sight, an apparent difference is the implementation level 2 routine (cgeru) with level 1 routine (caxpy) which could probably be the cause of the bad performance or could there be something else at play here? I have tried perusing the code to look for a way not to have this transformation between level 1 and 2, but it seems impossible. I am attaching below the profile with the call graph for both OpenBLAS and netlib-blas. I did not use the lapack that comes along with OpenBLAS. Any thoughts or ideas are welcome.

libOpenblasLapackWithLibraryDebugEnabled.txt libBlasProfilingWithLibraryDebugEnabled.txt

martin-frbg commented 7 years ago

I am not sure if I can provide much insight here, but how exactly are you testing - what argument sizes, how many threads ? (Does performance improve when you limit the number of threads created to 2 or even just 1 by setting the environment variable OPENBLAS_NUM_THREADS accordingly ?)

sohailshuaib commented 7 years ago

Well, i am using a single thread only, to keep a comparison with netlib-blas valid. The size of matrix i am trying to invert is 48, and i typically perform this action a 1024 times. Are there any possibilities to backport a particular routine to its native implementation?

On Tuesday, 17 January 2017, 20:11, Martin Kroeker <notifications@github.com> wrote:

I am not sure if I can provide much insight here, but how exactly are you testing - what argument sizes, how many threads ? (Does performance improve when you limit the number of threads created to 2 or even just 1 by setting the environment variable OPENBLAS_NUM_THREADS accordingly ?)— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub, or mute the thread.

martin-frbg commented 7 years ago

I guess it would be possible to link the netlib cgeru (or equivalent c code) into your application so that the OpenBLAS implementation is not called, or alternatively to add netlib's straightforward algorithm to the function(s) under the CNAME alias in ger_thread.c so that it gets used in place of the more elaborate code whenever lda is small. (I seem to remember something similar being discussed in earlier issues where the OpenBLAS implementation is overkill for tiny matrix sizes)

sohailshuaib commented 7 years ago

Thanks @martin-frbg for the suggestions. However, being a newbie to openblas, i am finding it difficult to wade through the apparently complex implementation of openblas. I have located the aliases you mentioned and put a few printfs around in the code to have a feeling of the code flow. It compiles successfully, but does not print anything. I have tried, but could not figure out if printfs are being re routed or suppressed altogether by openblas? Is there some other documentation i could consult to have an understanding of the code structure/design and code flow of openblas?

ashwinyes commented 7 years ago

@sohailshuaib Are you building for ARMv8 ? The topic says ARMv7 but in the comment you mentioned A53. This is contradicting.

If you are indeed building for ARMv8, can you please try building with make option "TARGET=CORTEXA57"

sohailshuaib commented 7 years ago

@ashwinyes it is possible to build for ARMv7 on a A53 device. The discussion in this thread has evolved to be rather independent of ARMvx and relates to lower performance of openblas when used for certain operations (e.g. matrix inverse)

sohailshuaib commented 7 years ago

@martin-frbg i can corroborate your thoughts on openblas being an overkill for tiny matrices. I was able to confirm this by profiling for larger (on the order of 1000) sized matrices.

ashwinyes commented 7 years ago

@sohailshuaib Thanks for clarifying.

The reason I asked is because I saw the following question in one of the comments.

Does OpenBLAS with ARMv8 target use ARM Neon instructions?

To clarify further, "TARGET=ARMV8" uses default C implementations for majority of the APIs. However "TARGET=CORTEXA57" uses the assembly implementations which has the ARMv8 SIMD instructions.

Anyways, my suggestion doesn't look relevant to this discussion. Hence please ignore it.

sohailshuaib commented 7 years ago

@ashwinyes thanks for the input. Just for the record, I, indeed, tried your suggestion, but to no effect.

martin-frbg commented 7 years ago

@sohailshuaib the only documentation I am aware of is https://github.com/xianyi/OpenBLAS/wiki/Developer-manual which at least provides a quick overview of the code layout. And as far as I know, fprintf(stderr,...) should work everywhere (though with all the CNAME substitution it is sometimes helpful to consult the local Makefile to find which file gets compiled into which function under the influence of which defines or un-defines).

sohailshuaib commented 7 years ago

Thanks @martin-frbg , the developer manual had good hints.

ashwinyes commented 7 years ago

@sohailshuaib One more suggestion if you have not already tried it.

The profiling dat that you shared in file libOpenblasLapackWithLibraryDebugEnabled.txt shows that blas_thread_server is taking up the majority of time.

You can try by completely disable threading in OpenBLAS by using make option "USE_THREAD=0".

sohailshuaib commented 7 years ago

trying to replace a level 2 implementation of openblas with its netlib-blas (using clapack) counterpart has led me to a confusion with regards to the implementation of complex data in openblas routines. For example CGERU (zger.c) expects FLOAT which in itself is mapped to the primitive float instead of a complex datatype as in netlib-blas. Is it the case or am i missing something?

martin-frbg commented 7 years ago

Without looking at the code I would suspect that the array is treated as paired real and imaginary parts internally ?

sohailshuaib commented 7 years ago

i could observe some improvements after using netlib-blas implementation of some routines e.g cgeru etc. In the meanwhile i also tried the suggestion of @ashwinyes which turned out to be the real deal breaker. Using this option, i can report that openblas performed better than netlib-blas for matrix inverse. I need to see in detail why this could have been a bottleneck earlier, and also test it on ARM. Currently i tested this on Intel Haswell.

ashwinyes commented 7 years ago

Good to know that this worked. Using threads for small problem sizes is obviously an overhead.

However, setting OPENBLAS_NUM_THREADS=1 with multithreaded OpenBLAS library also should have seen similar performance improvemnet

martin-frbg commented 7 years ago

Shouldn't CGERU (zger.c) use a single thread for small problem sizes automatically, thanks to jeromerobert's work on #731 ? Perhaps there is another similar bottleneck elsewhere.

brada4 commented 7 years ago

Without code sample I cannot guess which of dozen LU factorisation functions to approach.... @ashwinyes can you run perf record yourprogram ; perf report -U > profile.txtand attach file?

sohailshuaib commented 7 years ago

So the fix was to not to build openblas in a multithreaded configuration or enforce to use a single thread at runtime by setting OPENBLAS_NUM_THREADS=1. Essentially, creating and managing threads for small problems introduces additional overheads which far outweighs the gains of openblas. With this lesson, the issue can be closed now. Thanks to everyone who shared their suggestions and helping remarks :).

OpenMathLib / OpenBLAS

Slower performance of OpenBLAS for inverse operation on ARMv7 #1045