OpenBLAS DGEMM achieves high efficiency, for example, over 90% of peak performance with 1 thread on Graviton3E, but the efficiency drops to about 73% when running DGEMM with 64 threads.
As is known, it is becoming difficult to keep high efficiency for multi-thread execution on recent many-core CPUs, even if high-performance kernels are implemented for single-thread execution.
I am considering to adjust the shape of the submatrix handled by each thread by modifying 2D thread distribution.
I would appreciate it if you could let me know if you have any suggestions.
OpenBLAS DGEMM achieves high efficiency, for example, over 90% of peak performance with 1 thread on Graviton3E, but the efficiency drops to about 73% when running DGEMM with 64 threads. As is known, it is becoming difficult to keep high efficiency for multi-thread execution on recent many-core CPUs, even if high-performance kernels are implemented for single-thread execution.
I am considering to adjust the shape of the submatrix handled by each thread by modifying 2D thread distribution. I would appreciate it if you could let me know if you have any suggestions.