Multi-threaded DGEMM becomes less efficient on many-core CPUs

OpenBLAS DGEMM achieves high efficiency, for example, over 90% of peak performance with 1 thread on Graviton3E, but the efficiency drops to about 73% when running DGEMM with 64 threads. As is known, it is becoming difficult to keep high efficiency for multi-thread execution on recent many-core CPUs, even if high-performance kernels are implemented for single-thread execution.

I am considering to adjust the shape of the submatrix handled by each thread by modifying 2D thread distribution. I would appreciate it if you could let me know if you have any suggestions.

graph

OpenMathLib / OpenBLAS

Multi-threaded DGEMM becomes less efficient on many-core CPUs #4644