JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.68k stars 5.48k forks source link

Implementing new IndexStyle / type of index? #38284

Closed Luapulu closed 3 years ago

Luapulu commented 3 years ago

I've been working on adding a TiledIndex with a IndexTiled <: IndexStyle, for all sorts of arrays that are best indexed into via an index representing the tile and an index within that tile. See issue in TiledIteration.jl. Along the way I had a few thoughts:

  1. Should reduce operations and possibly a number of others dispatch on IndexStyle, thus allowing users to implement new index types along with efficient algorithms for those index types.
  2. Same thought goes for Base.to_index and Base.to_indices. Should these dispatch on the IndexStyle, so that indices are converted to the correct type, based on the IndexStyle. This would allow any user defined index to work with any array type as long as the relevant to_index and to_indices functions are implemented!
  3. Should we add a section in the manual on implementing new index types. It could extend the section on Arrays with custom indices

May be I've misunderstood the point of IndexStyle. If that's the case, can anyone point me to some resources on implementing custom indexing? The docs only have a section for adding offset indexing, but not for more complicated indexing.

timholy commented 3 years ago

This seems like an appropriate usage to me. I've got some deadlines so may not get to you quickly, but I don't think you're off-base to be thinking this way.

Luapulu commented 3 years ago

I've started working on some of these ideas in TiledArrays.jl.

johnnychen94 commented 3 years ago

I may have misunderstood the issue here. After https://github.com/JuliaArrays/TiledIteration.jl/pull/23 (v0.3.0), TileIteration has changed to array interface, e.g.,:

julia> A = rand(200, 50);

julia> TileIterator(axes(A), (128,8))
2×7 TileIterator{2,Tuple{TiledIteration.CoveredRange{StepRange{Int64,Int64},TiledIteration.LengthAtMost},TiledIteration.CoveredRange{StepRange{Int64,Int64},TiledIteration.LengthAtMost}}}:
 (1:128, 1:8)    (1:128, 9:16)    (1:128, 17:24)    (1:128, 25:32)    (1:128, 33:40)    (1:128, 41:48)    (1:128, 49:50)
 (129:200, 1:8)  (129:200, 9:16)  (129:200, 17:24)  (129:200, 25:32)  (129:200, 33:40)  (129:200, 41:48)  (129:200, 49:50)
Luapulu commented 3 years ago

The TiledIterator is an array to help index into another. With a ‘TiledArray’ I mean an Array that contains the actual data, that can, or even must be, indexed into tile by tile — An easy example being chunked data on disk.

The problem is, adding a new kind of index currently requires rewriting a whole lot of code. Reduce operations, broadcasting, views, iteration, sorting, reshaping, etc. have to be rewritten algorithmically, but you also have to implement a whole system of conversions between different indices, i.e. cartesian to tile, tile to cartesian, tile of of one tiling style to tile of another tiling style and this has to work for any kind of regular tiling.

This is why I thought it might be possible to hijack Base.to_indices. You’d still have to rewrite the algorithmic aspects for the new index type, but at least you wouldn’t also have to reimplement conversion between indices everywhere. Just define to_indices once and you’d be done. This also allows indexing into already existing array types with a new index.

The idea then is to also make to_indices dispatch on IndexStyle, so as to allow different arrays to have different index styles, even if they are subtypes of each other, or otherwise related in the type tree.

Luapulu commented 3 years ago

More importantly, sort, reduce, broadcast, etc. also have to dispatch on IndexStyle, for all of this to work.

johnnychen94 commented 3 years ago

So if I understand it correctly, if At is a TiledArray, then At[1] returns an array here? Does it bring any actual performance improvement compared to A[TileItertor(...)[1]] where A is the normal array?

Or you just want to store the tile metadata to normal array A, and use it to create R = TileIterator(...) so that indexing A[R[1]] is efficient? If this is the case, then a very thin array wrapper can be made for this purpose and there's no need to create a new IndexStyle.

I'm a little skeptical to mix the normal array world and the fancy block array world.

Luapulu commented 3 years ago

No, At[1] would return the first element of At, not an array. At[TileIndex(1)] would return the first tile (an array in most cases) and something like At[TileIndex(1), 1, 2, 3] would return the element (1, 2, 3) within the first tile. The user would rarely use TileIndex(1) explicitly. Instead, operations like sum or At .*= 3 would use TileIndex implicitly.

johnnychen94 commented 3 years ago

At[TileIndex(1)] would return the first tile (an array in most cases) and something like At[TileIndex(1), 1, 2, 3] would return the element (1, 2, 3) within the first tile.

It's not very clear to me why TileIterator doesn't fit the usage. Say R = TileIterator(...), then A[R[1]...] return the first tile. It's not implemented yet, but it's doable to make TileIterator with eltype CartesianIndices so that A[R[1]] works.

I still don't get why we need a new IndexStyle for this usage.

Luapulu commented 3 years ago

One example where IndexStyle come in useful: eachindex dispatches on IndexStyle to deliver an efficient iterator over all the indices in an array. For tiled arrays, that iterator should return indices tile by tile. So, something like:

eachindex(::IndexTiled, A) = ((ti, inds) for inds in eachindex(A[ti]) for ti in eachtileindex(A))

Just to clarify, I very much intend on using TileIterator. The goal is not to replace, but to build on top of.

mbauman commented 3 years ago

This is indeed different from how IndexStyle has thus far been used — it really only says whether linear or cartesian indexing is fastest, and doesn't change the order of iteration. In fact, I wouldn't be surprised if there were generic code that assumes order of iteration of an eachindex is columnar (or at least in sync with another array's eachindex).

As a concrete counter point, we don't use a TransposedStyle to change the iteration order for adjoints. Maaaaybe we could, but this gets into https://github.com/JuliaLang/julia/issues/15648 territory.

johnnychen94 commented 3 years ago

Just to clarify, I very much intend on using TileIterator. The goal is not to replace, but to build on top of.

This is exactly my worries. I'm worried that the changes to add a new IndexStyle, which will be a big change, does not bring any actual improvement on both performance and API(IIUC).

~There's no API for eachindex(::IndexStyle, A...) and~ there is no API for getindex(::IndexStyle, A, inds...), a bunch of such API changes is almost not worth doing if it only marginally improves the usage API. Not to mention that these changes would be very likely to break a lot of codes in downstream Julia packages.

Luapulu commented 3 years ago

It might not be public, but it's there https://github.com/JuliaLang/julia/blob/539f3ce943f59dec8aff3f2238b083f1b27f41e5/base/abstractarray.jl#L266

As far as performance, in the same way that remainder division makes it slow to go from linear index to cartesian index, it is slow to go from cartesian index to tile index. The reverse is faster, especially when you have a buffer that you're filling with values from a larger array. You want to fill the buffer once, do everything you need with it and then fill it with new values for the next chunk / tile / piece.

15648 is very interesting, thanks for the link.

May be hijacking IndexStyle is not the way to go, but there must (or ought to) be some way to specify how an array should be iterated over and indexed into?

mbauman commented 3 years ago

I think you could still use an IndexTiled style here... but the trick would be if you could get a cleverly fast iterator to return a TileIndex(tileid, i, j) that iterated column major. Getting that iterator to be fast in column major order would be very tough though.

Luapulu commented 3 years ago

So, punt on the iteration order part and just tackle the multiplication being faster than remainder division part using IndexStyle?

If that's the case, it isn't much of a solution for something like DIskArrays.jl because iteration order is the key issue there. Unfortunately, that's also the main thing I want to work on :(.

It might help with stuff like virtual concatenation and more complex tiling though.

Luapulu commented 3 years ago

Just found BlockArrays.jl. That might solve my concrete issues at the moment.

I still wonder if an array interface that allows for different iteration order is possible. Not sure what role IndexStyle would have in such an interface.

Luapulu commented 3 years ago

ArrayInterface.jl seems to be dealing with some of these ideas. I’ll close this, since that’s where these sorts of discussions seem to be taking place.

timholy commented 3 years ago

That seems reasonable. There's also https://github.com/timholy/ArrayIteration.jl but the parts of it that haven't already moved to Base are mostly focused on the problem of sparsity.