cd benchmarks/DeepLearning/Ops/GEMM
make gemm-affine # will get a IJK loop order case.
# cd into your build path and run
cmake --build .. --target clean && cmake --build .. && ./gemm-benchmark
Then you'll get some output like this:
2022-06-29T14:57:30+08:00
Running ./gemm-benchmark
Run on (6 X 3901.13 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x6)
L1 Instruction 32 KiB (x6)
L2 Unified 256 KiB (x6)
L3 Unified 9216 KiB (x1)
Load Average: 1.58, 0.86, 0.79
***WARNING*** Library was built as DEBUG. Timings may be affected.
-----------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
-----------------------------------------------------------------------
BM_GEMM/50 242182 ns 242091 ns 2874 GFLOPS=1.03681
BM_GEMM/100 2113227 ns 2112343 ns 331 GFLOPS=0.949723
BM_GEMM/150 7136396 ns 7133468 ns 98 GFLOPS=0.949057
BM_GEMM/200 16976603 ns 16969488 ns 41 GFLOPS=0.932451
... skipped ...
If you want to add tilling:
make gemm-affine && buddy-opt opt_gemm.mlir --affine-loop-tile=tile-sizes=96,96,96 -o opt_gemm.mlir
How-To
In order to run some benchmark, now you should:
Then you'll get some output like this:
If you want to add tilling:
Will get output
opt_gemm.mlir
like this:Change Order into
JPI_PIJ
:For more information about
--loop-order-change
, see hereExperiments Result
After some experiments (i5-8400@3.8GHz, L1 32 KiB L2 256KiB, L3 9216KiB), this chart show how loop order and tilling strategy effect performance:
more clearly, here's 6 different loop orders performance data:
And if we add tilling into
JPI
, we got around 9x improvement in bigger data size:For now, I think it's time to add explicit copy and packing for higher performance.
Experiments Data