OpenMathLib / OpenBLAS

OpenBLAS is an optimized BLAS library based on GotoBLAS2 1.13 BSD version.
http://www.openblas.net
BSD 3-Clause "New" or "Revised" License
6.35k stars 1.49k forks source link

Parameter adjustment of [CZ]GEMM_DEFAULT_[PQ] for Neoverse V1 #4742

Open tetsuzo-usui opened 4 months ago

tetsuzo-usui commented 4 months ago

Hello. In the previous pull request #4381, the P and Q parameter of [SD]GEMM were increased to make better use of the L2 cache of Neoverse V1, but the complex [CZ]GEMM parameters remained unchanged. I tried adjusting them hoping for a similar performance improvement.

[CZ]GEMM_DEFAULT_P is adjusted to 120 since the data length of 240 of [SD]GEMM_DEFAULT_P corresponds to 120 in complex elements. [CZ]GEMM_DEFAULT_Q is set to 320 for double precision and 640 for single precision, which results in similar blocking as the real routines in terms of cache usage. Below is the performance graph of 64 threads that shows the improvement:

OpenBLAS_ZGEMM_PQparam

However, performance degradation of xTRMM has been observed as a side effect of this parameter change.

OpenBLAS_ZTRMM_PQparam

This is analyzed as follows: In the TRMM calculation, the kernels used internally are reduced to GEMM_KERNEL and TRMM_KERNEL. The change in blocking reduces the amount of calculations performed by GEMM_KERNEL, and increases the amount of calculations performed by the less efficient TRMM_KERNEL. As a result, the overall calculation becomes less efficient.

If you agree that improving the performance of the TRMM_KERNEL as a separate issue to be addressed in the future, [CZ]GEMMDEFAULT[PQ] parameters can be changed in advance. In that case, please let me know and I will fix the parameters.

martin-frbg commented 4 months ago

I'm a bit worried that the performance loss of TRMM appears to be proportionally greater than the gain achieved in GEMM, as far as I can make out from your graphs. Or am I mistaken ?