From this, the kernel GOPS can simply be computed using PerfKernelTops * 1000. I ran this example for various square matrix sizes and plotted the GOPS on the vertical axis against matrix sizes on the horizontal axis. The resulting performance profile exhibits weird behavior, as seen below. The performance drops significantly for matrix sizes that are multiples of 4096.
I tried various different data types (float, int32, int16), two FPGA cards (U50 and U280), and both HBM and DDR memory interfaces, and the results are fairly consistent for each memory interface. That is, for U50 HBM and U280 HBM, the performance drops are seen at matrix sizes that are multiples of 4096, while for U280 DDR, the performance drops are seen at some other matrix sizes.
Are there any guesses as to why this phenomenon occurs? The kernel loads submatrices out of the large input matrices and performs GeMM on them. So, this shouldn't happen.
I implemented and ran the
gemm_1CU
example from here on a U50 card. The output is of the formFrom this, the kernel GOPS can simply be computed using
PerfKernelTops * 1000
. I ran this example for various square matrix sizes and plotted the GOPS on the vertical axis against matrix sizes on the horizontal axis. The resulting performance profile exhibits weird behavior, as seen below. The performance drops significantly for matrix sizes that are multiples of4096
.I tried various different data types (float, int32, int16), two FPGA cards (U50 and U280), and both HBM and DDR memory interfaces, and the results are fairly consistent for each memory interface. That is, for U50 HBM and U280 HBM, the performance drops are seen at matrix sizes that are multiples of 4096, while for U280 DDR, the performance drops are seen at some other matrix sizes.
Are there any guesses as to why this phenomenon occurs? The kernel loads submatrices out of the large input matrices and performs GeMM on them. So, this shouldn't happen.