Closed dsharlet closed 10 months ago
Just superficial comments. I need more cycles to remember how z order traversal works in ND.
I think the easiest way to understand it is that it's a depth first traversal of an N-dimensional generalization of a quadtree/octree. A lot of things (e.g. the wikipedia article) talk about it as a bit manipulation of the indices, I think that approach is a lot harder to implement, especially if you want to efficiently prune the parts of the tree that are out of bounds.
This PR adds a helper to do Z order traversals of multi-dimensional iterator ranges, including the results of
split
/split<>
. This PR uses the new helpers to get a 20-25% speedup of matrix multiply. The inner loop of the matrix multiply using the z order traversal looks great in the profiler (memory stalls are much smaller).This PR adds a BLAS benchmark (use
BLAS=1 make ...
to enable it). At first, I thought this z ordered multiply was beating OpenBLAS by ~10%. However, it seems that OpenBLAS was parallelizing by default, and that made it slower. If I force OpenBLAS to use one thread, it gets faster, and it's now beating my multiply by ~5% :(