Thread scaling of concurrent small dgemm operations

robertjharrison commented 4 years ago

My app performs many small dgemms, each invoked by a separate thread (via a task pool). As recommended I compiled OpenBlas 3.10 with USE_THREAD=0 and USE_LOCKING=1. This is on Cavium ThunderX2 with gcc 9.2.0.

The scaling with threads is really bad, presumably because of the locking. Replacing the OpenBlas dgemm call with a hand-written kernel using neon intrinsics gives comparable single thread performance, but when using 30 threads pinned to the cores of a single socket (that has 32 physical cores) the hand-written code is about 6x faster, entirely due to superior thread scaling.

The dominant dgemms sizes are (12,144)^T (12,12) and (16,256)^T (16,16).

I wonder if there is a lower-level interface to your optimized small-matrix kernels that I can invoke that bypasses the use of static memory blocks that need the lock protection?

Also, please note that the default RedHat EL8.3 openblas_serial package is not compiled with USE_LOCKING and so produces incorrect results in this use case.

Finally, many thanks for OpenBlas ... it is a tremendously valuable tool and I appreciate the effort it takes to make it happen.

Thanks

Robert

martin-frbg commented 4 years ago

This was implemented for SkylakeX SGEMM fairly recently (see interface/gemm.c and the x86_64 sgemm_kernel_direct it calls) but not ported to other architectures and functions yet.

brada4 commented 4 years ago

You can get locking by setting OPENBLAS_NUM_THREADS=1 and using pthreads version.

robertjharrison commented 4 years ago

Thanks Martin ... I will look at gemm.c and x86_64 sgemm_kernel_direct and see what gap needs filling for ARM64.

OpenMathLib / OpenBLAS

Thread scaling of concurrent small dgemm operations #2712