[BLAS, L2 GeMM] Throughput decreases for specific matrix sizes

I implemented and ran the gemm_1CU example from here on a U50 card. The output is of the form

DATA_CSV:,MemWidth,Freq,M,K,N,Ops,KernelCycles,TimeKernelMs,TimeApiMs,EffKernelPct,EffApiPct,PerfKernelTops,PerfApiTops

From this, the kernel GOPS can simply be computed using PerfKernelTops * 1000. I ran this example for various square matrix sizes and plotted the GOPS on the vertical axis against matrix sizes on the horizontal axis. The resulting performance profile exhibits weird behavior, as seen below. The performance drops significantly for matrix sizes that are multiples of 4096.

u50_int16_l2_gemm

I tried various different data types (float, int32, int16), two FPGA cards (U50 and U280), and both HBM and DDR memory interfaces, and the results are fairly consistent for each memory interface. That is, for U50 HBM and U280 HBM, the performance drops are seen at matrix sizes that are multiples of 4096, while for U280 DDR, the performance drops are seen at some other matrix sizes.

Are there any guesses as to why this phenomenon occurs? The kernel loads submatrices out of the large input matrices and performs GeMM on them. So, this shouldn't happen.

Xilinx / Vitis_Libraries

[BLAS, L2 GeMM] Throughput decreases for specific matrix sizes #197