OpenMathLib / OpenBLAS

OpenBLAS is an optimized BLAS library based on GotoBLAS2 1.13 BSD version.
http://www.openblas.net
BSD 3-Clause "New" or "Revised" License
6.26k stars 1.48k forks source link

Multi-threaded DGEMM becomes less efficient on many-core CPUs #4644

Open yamazakimitsufumi opened 4 months ago

yamazakimitsufumi commented 4 months ago

OpenBLAS DGEMM achieves high efficiency, for example, over 90% of peak performance with 1 thread on Graviton3E, but the efficiency drops to about 73% when running DGEMM with 64 threads. As is known, it is becoming difficult to keep high efficiency for multi-thread execution on recent many-core CPUs, even if high-performance kernels are implemented for single-thread execution.

I am considering to adjust the shape of the submatrix handled by each thread by modifying 2D thread distribution. I would appreciate it if you could let me know if you have any suggestions.

graph

brada4 commented 4 months ago

8mb matrices are probably less than caches by magnitude.