Closed lukstafi closed 5 months ago
But also benchmark if inline per-device pointers offer any optimization opportunity to gccjit.
I don't think I plan to work on this more for now. Could be fixed by making merges parametric wrt. sources as well as destinations.
Closing this tentatively, this was probably affected by the now-fixed issue with unlimited inlining.
First check what's the contribution of shape inference (should have no extra contribution), jitting, and external gccjit calls. Most likely it's the gccjit. That would be solvable: compile once, pass in per-device pointers. Note that cudajit is already also taking pointers as input.