[DeepLearning/Ops] Add Batch Matmul Benchmark with CMake Integration to DeepLearning Ops

Hi, I have a question regarding optimization strategies: is the difference between scalar and auto_vectorization due to llc being enabled with O0 and O3, or due to their different lowering passes?

If the intention is to compare the effect of llc with O0 and O3, you could name them explictly(e.g. scalar-llc-O0 and scalar-llc). If it is about comparing lowering passes, it seems to me that the passes for auto-vectorization do not include vectorization, since it uses -convert-linalg-to-loops, which converts loops directly into a scalar version.

To generate a auto-vectorization version, you can try integrating the existing buddy-mlir pass batchmatmul-optimize. Maybe you can try this lowering path:

      --linalg-bufferize
      --batchmatmul-optimize
      --convert-linalg-to-loops
      --func-bufferize
      --arith-bufferize
      --tensor-bufferize
      --finalizing-bufferize
      --lower-affine
      --convert-scf-to-cf
      --expand-strided-metadata
      --convert-vector-to-llvm
      --memref-expand
      --arith-expand
      --convert-arith-to-llvm
      --finalize-memref-to-llvm
      --convert-math-to-llvm
      --llvm-request-c-wrappers
      --convert-func-to-llvm
      --reconcile-unrealized-casts

The other code parts look good to me, thanks!

buddy-compiler / buddy-benchmark

[DeepLearning/Ops] Add Batch Matmul Benchmark with CMake Integration to DeepLearning Ops #133

Changes