ahrefs / ocannl

OCANNL: OCaml Compiles Algorithms for Neural Networks Learning
BSD 2-Clause "Simplified" License
62 stars 2 forks source link

`moons_benchmark` gccjit parallel startup time is very long (linear with num of devices?) #239

Closed lukstafi closed 5 months ago

lukstafi commented 6 months ago

First check what's the contribution of shape inference (should have no extra contribution), jitting, and external gccjit calls. Most likely it's the gccjit. That would be solvable: compile once, pass in per-device pointers. Note that cudajit is already also taking pointers as input.

lukstafi commented 6 months ago

But also benchmark if inline per-device pointers offer any optimization opportunity to gccjit.

lukstafi commented 5 months ago

I don't think I plan to work on this more for now. Could be fixed by making merges parametric wrt. sources as well as destinations.

lukstafi commented 5 months ago

Closing this tentatively, this was probably affected by the now-fixed issue with unlimited inlining.