Open aidansander opened 4 months ago
Hi @aidansander,
could you also provide the lut_based_ops.h
header to be able to reproduce your results?
Sure thing. lut_based_ops.h and lut_based_ops.cpp (which holds the LUT values) are both included from here. I'm running the kernel using llvm-lit on some tests. The manually pipelined and loop pragma tests have the steps I used to run and measure the cycle count.
Thanks!
One thing to consider: LoopUnroll runs before MachinePipeliner, i.e. if you completely unroll the loop (through #pragma unroll
), there is nothing to do for the pipeliner.
There are a few things we are still lacking though to get this software pipelined:
With these tweaks, I could get a SWP loop with 23 cycles. It should be possible to further improve on that. FYI @gbossu @martien-de-jong @andcarminati
Update: 1. & 2. have been resolved. Last point - MachineMemOperands for VLDB.4x instructions - still needs to be done.
I'm compiling a simple kernel using peano. Manually software pipelining the attached kernel (dut_pipelined.cc) yields considerable speedup compared to using pipelining pragmas (dut_pragma.cc). Without manual pipelining, the produced assembly does not pipeline and the kernel runs in ~1800 cycles. With manual pipelining, the kernel runs in ~1000 cycles. The
clang loop min_iteration_count
andmax_iteration_count
pragmas have no effect on the produced assembly. dut_pragma.cc dut_pipelined.cc