At the moment the L1L2 and select indexes are interleaved (next to each other for each L0 block). As each are >4MiB this doesn't actually reduce the number of cache misses at all (I think), and it makes the indexing etc. much more complicated.
This should be replaced with a layout that has first the L0, then all of the L1L2 parts, then all of the select parts. This would allow casting to the right slice types once at the top level and would reduce the amount of repeating divisions etc. The building should also be done properly separately, without the select index being built in the middle of building the rank indexes.
At the moment the L1L2 and select indexes are interleaved (next to each other for each L0 block). As each are >4MiB this doesn't actually reduce the number of cache misses at all (I think), and it makes the indexing etc. much more complicated.
This should be replaced with a layout that has first the L0, then all of the L1L2 parts, then all of the select parts. This would allow casting to the right slice types once at the top level and would reduce the amount of repeating divisions etc. The building should also be done properly separately, without the select index being built in the middle of building the rank indexes.