Open robertjharrison opened 4 years ago
This was implemented for SkylakeX SGEMM fairly recently (see interface/gemm.c and the x86_64 sgemm_kernel_direct it calls) but not ported to other architectures and functions yet.
You can get locking by setting OPENBLAS_NUM_THREADS=1 and using pthreads version.
Thanks Martin ... I will look at gemm.c and x86_64 sgemm_kernel_direct and see what gap needs filling for ARM64.
My app performs many small dgemms, each invoked by a separate thread (via a task pool). As recommended I compiled OpenBlas 3.10 with USE_THREAD=0 and USE_LOCKING=1. This is on Cavium ThunderX2 with gcc 9.2.0.
The scaling with threads is really bad, presumably because of the locking. Replacing the OpenBlas dgemm call with a hand-written kernel using neon intrinsics gives comparable single thread performance, but when using 30 threads pinned to the cores of a single socket (that has 32 physical cores) the hand-written code is about 6x faster, entirely due to superior thread scaling.
The dominant dgemms sizes are (12,144)^T (12,12) and (16,256)^T (16,16).
I wonder if there is a lower-level interface to your optimized small-matrix kernels that I can invoke that bypasses the use of static memory blocks that need the lock protection?
Also, please note that the default RedHat EL8.3 openblas_serial package is not compiled with USE_LOCKING and so produces incorrect results in this use case.
Finally, many thanks for OpenBlas ... it is a tremendously valuable tool and I appreciate the effort it takes to make it happen.
Thanks
Robert