JuliaLang / julia

The Julia Programming Language

https://julialang.org/

MIT License

45.41k stars 5.45k forks source link

Julep: solving tricky iteration problems #15648

Closed timholy closed 5 years ago

timholy commented 8 years ago

The problem (a brief summary)

In a PR for a blog post, I described a number of challenges for several potential new AbstractArray types. Very briefly, the types and the challenges they exhibit are:

ReshapedArrays: the only efficient way of indexing some types is by indexing the parent array directly (i.e., with an iterator optimized for the pre-reshaping array). This means that indexing a 2D array as B[i,j], for integer i and j, may not always be efficient.
TransposedMatrix (for #4774): iteration is most efficient when it produces a cache-friendly sequence of array accesses. Unfortunately, algorithms that mix column-major and row-major arrays imply that we can't naively allow eachindex(A) to always return the most efficient iterator.
OffsetArrays (sometimes called "Fortran arrays"): these violate the assumptions that array indexes run from 1:size(A,d), and more generally question how one should match elements of two different arrays.
Possible APIs

In the blog post, only two concrete algorithms were considered: matrix-vector multiplication and copy!. At the risk of being insufficiently general, let's explore possible APIs in terms of these two functions.

EDIT: note that the thoughts on the API are evolving as a consequence of this discussion. To jump ahead to an updated proposal, see https://github.com/JuliaLang/julia/issues/15648#issuecomment-205306513. There may be further updates even later. The material below is left for reference, but some of it is outdated.

The "raw" API

For the moment, let's worry about correctness and efficiency more than elegance. Correctness trumps all other considerations. Efficiency means that we have to be able to iterate over the matrix in cache-friendly order. That means that the array B has to be "in charge" of the sequence of values and pick its most efficient iterator.

I should emphasize that I haven't taken even the first step towards attempting to implement this API. So this is crystal-ball development, trying to imagine what might work while hoping to shake out likely problems in advance via community feedback. That caveat understood, here's a possible implementation of matrix-vector multiplication:

RB = eachindex(B)
Rv = eachindex(RB[*,:], v)
Rdest = eachindex(RB[:,*], dest)
for (idest, iB, iv) in zip(Rdest, RB, Rv)
    dest[idest] += B[iB] * v[iv]
end

eachindex(iter, obj) should perform bounds-checking, to ensure that the indexes of iter align with obj (even handling things like OffsetArrays). The notation RB[*,:] means "corresponding to the second dimension of RB"; interestingly, this expression parses without error, attempting to dispatch to getindex with argument types Tuple{typeof(RB), Function, Colon}. Hopefully there isn't another pending meaning for indexing with a Function.

Because either idest or iv is constant over the "inner" loop (depending on storage order of B), ideally one would like to hoist the corresponding access dest[idest] or v[iv]. I'd argue that's a job for the compiler, and not something we should have to bake into the API.

Because zip(X, Y) does not allow Y to see the state of X, it seems quite possible that this could not be made efficient in general (consider the case described for ReshapedArrays, where the efficiently-parametrized iterator for the first dimension depends on the state of the index for the second dimension). If this proves to be the case, we have to abandon zip and write this more like

RB = eachindex(B)
for (iB, idest, iv) in couple(RB, (RB[:,*], dest), (RB[*,:], v))
    dest[idest] += B[iB] * v[iv]
end

where couple is a new exported function.

One interesting thing about this API is that sparse multiplication might be implemented efficiently just by replacing the first line with

RB = eachstored(B)

so that only "stored" elements of B are visited. (@tkelman may have been the first to propose an eachstored iterator.) It should be pointed out that several recent discussions have addressed some of the complications that arise from Inf and NaN, and it remains to be determined whether a version that doesn't ignore these complications can be written with this (or similar) API.

Likewise, copy! might be implemented

Rdest = eachindex(dest)
for (idest, isrc) in couple(Rdest, (Rdest, src))
    dest[idest] = src[isrc]
end

At least for the array types described here, this API seems plausibly capable of generating both correct and (fairly) efficient code. One might ideally want to write generic "tiled" implementations, but that seems beyond the scope of this initial effort.

Making it prettier with "sufficiently smart iteration"

This API is still more complicated than naive iteration. One might possibly be able to hide a lot of this with a magical @iterate macro,

@iterate B dest[i] += B[i,j]*v[j]

for matrix-vector multiplication and

@iterate dest dest[i] = src[i]

for copy!. The idea is that the @iterate macro would expand these expressions into their longer forms above. Note that the first argument to @iterate specifies which object is "in charge" of the sequence of operations. @iterate might be too limited for many things, but may be worth exploring. The KernelTools repository may have some useful tidbits.

meggart commented 8 years ago

Thanks a lot for writing the blog post and this issue, I learned a lot from it.

I have a question: You propose

    RB = eachindex(B)
    Rv = eachindex(RB[*,:], v)

If B is an Array RB will simply be 1:length(B), and lose all the shape information of B and makes your life harder interpreting second eachindex call.

In general, if you can come up with a couple function (which I think would be great), do you need the prior eachindex call at all? Why not simply do for (iB, idest, iv) in couple(B, (B[:,*], dest), (B[*,:], v)). where again B is the array that determines the iteration order.

So if possible I would prefer having a single couple function that could be heavily specialized for different argument types and one would need eachindex and zip only internally (called by couple).

I am also not sure if it is always a good idea to put one iterator in charge of the whole iteration order. How would you, for example formulate the loop for map!(dest,f,a,b), where the 3 arrays might have different storage orders? Could one implement a couple function that lets the three arrays dest, a, and b decide via "majority vote" how to iterate?

timholy commented 8 years ago

You're absolutely right; the intent was to keep all the shape information, and I overlooked the problem you pointed out. I was trying to avoid directly indexing arrays with something weird (the function *), since it seems like that could be unpopular. But what I wrote won't work. A possible alternative would be B[?,:], because ? is not defined yet, and so indexing arrays with ? seems like it might be OK.

Your two points also come together very nicely if one thinks about matrix-matrix (rather than matrix-vector) multiplication. First, you pretty much have to have a syntax that addresses the arrays directly, rather than some "master" iterator. Second, you're right that you probably don't want any iterator in charge. Something like this?

for (idest, iA, jdest, jB, kA, kB) in couple((dest[:,?], A[:,?]), (dest[?,:], B[?,:]), (A[?,:], B[:,?]))
    dest[idest,jdest] += A[idest,kA]*B[kB,jB]
end

which could hopefully be simplified to

@iterate dest[i,j] += A[i,k]*B[k,j]

Here, there's not actually much magic in @iterate (other than parsing the pattern of indexes and array-accesses), all the hard work is being done by couple.

My big worry is that couple could be quite hard to write. Still, it seems like the right place to put the complexity: do a bunch of analysis right at the beginning of the loop to figure out the optimal access pattern.

In some situations it may be difficult/impossible to make it type-stable (though all the ?/: indexing is designed to make this more feasible). As a fallback one could do

function foo!(dest, A, B)
    iters = couple(...)
    _foo(dest, A, B, iters)
end

@noinline function _foo!(dest, A, B, iters)
    for (idest, iA, jdest, jB, kA, kB) in iters
        ...
    end
end

However, for the @iterate macro this would need to be

function foo!(dest, A, B)
    iters = couple(...)
    @noinline function _foo!(dest, A, B, iters)
        for (idest, iA, jdest, jB, kA, kB) in iters
            ...
        end
    end
    _foo(dest, A, B, iters)
end

I haven't yet tested whether _foo! could be placed inside foo! and yet solve the type-instability problem. (I suspect so, but worth testing.)

eschnett commented 8 years ago

On Tue, Mar 29, 2016 at 11:18 AM, Tim Holy notifications@github.com wrote:

You're absolutely right; the intent was to keep all the shape information, and I overlooked the problem you pointed out. I was trying to avoid directly indexing arrays with something weird (the function *), since it seems like that could be unpopular. But what I wrote won't work. A possible alternative would be B[?,:], because ? is not defined yet, and so indexing arrays with ? seems like it might be OK.

Your two points also come together very nicely if one thinks about matrix-matrix (rather than matrix-vector) multiplication. First, you pretty much have to have a syntax that addresses the arrays directly, rather than some "master" iterator. Second, you're right that you probably don't want any iterator in charge. Something like this?

for (idest, iA, jdest, jB, kA, kB) in couple((dest[:,?], A[:,?]), (dest[?,:], B[?,:]), (A[?,:], B[:,?])) dest[idest,jdest] += A[idest,kA]*B[kB,jB]end

Ooooooh. This is now awfully close to an abstract index notation, which allows generalizing this to more than two dimensions:

for magic in joined(dest[:i,:j], A[;i,:k], B[:k,:j]) dest[magic,:i,:j] += A[magic,:i,:k] * B[magic,:k,:j] end

Obviously the syntax for the loop body can be improved, maybe with a macro, maybe with clever types (e.g. turning the indices i,j,k into functions that take the array as argument).

-erik

which could hopefully be simplified to

@iterate dest[i,j] += A[i,k]*B[k,j]

Here, there's not actually much magic in @iterate (other than parsing the pattern of indexes and array-accesses).

My big worry is that couple could be quite hard to write. Still, it seems like the right place to put the complexity: do a bunch of analysis right at the beginning of the loop to figure out the optimal access pattern.

In some situations it may be difficult/impossible to make it type-stable (though all the ?/: indexing is designed to make this more feasible). As a fallback one could do

function foo(dest, A, B) iters = couple(...) _foo(dest, A, B, iters)end @noinline function _foo(dest, A, B, iters) for (idest, iA, jdest, jB, kA, kB) in iters ... endend

However, for the @iterate macro this would need to be

function foo(dest, A, B) iters = couple(...) @noinline function _foo(dest, A, B, iters) for (idest, iA, jdest, jB, kA, kB) in iters ... end end _foo(dest, A, B, iters)end

I haven't yet tested whether _foo could be placed inside foo and yet solve the type-instability problem. (I suspect so, but worth testing.)

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/JuliaLang/julia/issues/15648#issuecomment-202949234

Erik Schnetter schnetter@gmail.com http://www.perimeterinstitute.ca/personal/eschnetter/

timholy commented 8 years ago

@eschnett, what you're suggesting is exactly what the @iterate macro does. Symbols don't have different types, so to do this via functions you have to introduce the appropriate types. But macros can parse the symbols, and emit code that is properly type-stable. In other words, the @iterate macro is effectively a pretty wrapper around couple.

eschnett commented 8 years ago

I was trying to generalize the notation for : and ?, and then got side-tracked by how the loop body should look like. So instead of writing

couple((dest[:,?], A[:,?]), (dest[?,:], B[?,:]), (A[?,:], B[:,?]))

one could write

coupled(dest[:i,:j,:k], A[:i,:j,:l], B[:j,:k,:m], C[:m])

which allows for more than two indices that can be used in arbitrary order, allowing transposition.

mbauman commented 8 years ago

Let's look at it from the other side. It seems there are three things @iterate needs to do for an expression like dest[i,j] += A[i,k]*B[k,j]:

Determine the optimal loop order. This is hard… and an endless task if we consider BLAS-like partial loop striding. I say we punt, and require the user to specify the order.
Determine how many times to run each loop. This is straightforward.
Request for an efficient index iterator from each array for the given access pattern. Instead of worrying about coupling these iterators together, what if we just asked each array for an iterator across a given cartesian permutation order? E.g.,

@iterate (i, j, k) begin # specify iteration order
   D[i,j] += A[i,k]*B[k,j]
end

# would expand to:
(sz1, sz2, sz3) = … # get loop lengths and validate array sizes
Ditr = eachindex(D, (1, 2)) # will want Val{(1,2)} or some such
Aitr = eachindex(A, (1, 2))
Bitr = eachindex(B, (2, 1))
Dstate = start(iD)
Astate = start(iA)
for _i=1:sz1
    Bstate = start(iB) # B isn't dependent upon i, @iterate just restarts it every loop
    for _j=1:sz2
        (iD, Dstate) = next(Ditr, Dstate) # D isn't dependent on k, so it hoists it
        Astate′ = Astate # A isn't dependent upon j, so we jump back to where we were
        for _k=1:sz3
            (iB, Bstate) = next(Bitr, Bstate)
            (iA, Astate′) = next(Aitr, Astate′)
            D[iD] += A[iA]*B[iB]
        end
    end
    Astate = Astate′
end

I believe this is sufficiently general that @iterate will always be able to work out where to start, hoist, and hold iteration states. The magic, of course, now happens within eachindex(A, permutation). And while I think that will also be rather challenging to write for more than two dimensions, I think it'd be much simpler than trying to couple iterators together.

I'm sure you're aware, Tim, but I'd like to also point to TensorOperations.jl, which I think does some of @iterate (just not the fast iterated non-cartesian part). And although it's reason for being is completely different, numpy.einsum is also somewhat similar. Lots of people seem to use it, but I've never been able to sink my teeth into it — I find it quite unreadable.

timholy commented 8 years ago

@eschnett, the reason I think you need to set this up in a type-stable manner is that some array types aren't efficiently indexed by an NTuple of integers. For example, if A is a SparseMatrixCSC, A[i,j] (for integer i, j) does not have impressive performance; a much better strategy is to iterate over an index into rowindex/nzval and keep track of the corresponding i, j as you go. ReshapedArrays are another good example, and the one that drove me to think about this. For both of these cases, you'd really like to use a custom iterator specialized to extract the last bit of performance. (Particularly for the sparse case, if you only need to visit filled values the savings are huge.)

@mbauman, I suspect you're right about some of the challenges. What you're describing is basically how the macros in KernelTools work currently; I also wrote a nifty little macro (@time on steroids) that tests all possible loop orderings and reports the time for each, thus giving the user the chance to hard-code the one that works best. I should take a look at TensorOperations, I haven't looked in a long time so probably don't remember all the tools (or they may have changed).

But for generic code (that might have to handle Matrix and MatrixTranspose), it sure would be nice to be able to be flexible without having to write tons of cases out by hand. May be a pipe dream (I am sure that many have tried to do this before...e.g., Polly).

timholy commented 8 years ago

@mbauman, in your eachindex(A, (d1, d2)), am I reading this right that you're specifying that dimension d1 is the "inner" dimension and d2 is the "outer" dimension? Seems like a promising idea!

mbauman commented 8 years ago

Yes, precisely. The core idea is that each array only needs to concern itself about the order of iteration, and the @iterate macro itself handles starting, hoisting, and holding the iteration state at particular points in the expansion to ensure things stay in sync.

timholy commented 8 years ago

I guess the reason I didn't go this way is because I was attracted by the hope of solving the loop-ordering problem. Since the macro can't see the types, I was looking for a way for it to set things up in a form that would be amenable to type analysis.

eschnett commented 8 years ago

@timholy I don't understand the connection between the order in which the indices are traversed and the syntax to describe which indices need to be coupled.

I agree that sparse (or reshaped, or symmetric where only part of the elements are stored, etc.) arrays require special care.

Let's take this loop as example:

for i,j,k in coupled(r[:i,:j,:k], a[:i,:j,:k], b[:j,:k,:i], c[:k,:i,:j])
   r[i,j,k] = a[i,j,k] + b[j,k,i] + c[k,i,j]
end

How would this look like in the ? / : notation?

timholy commented 8 years ago

Something like couple(iblock, jblock, kblock) where

iblock = (r[:,?,?], a[:,?,?], b[?,?,:], c[?,:,?])
jblock = (r[?,:,?], a[?,:,?], b[:,?,?], c[?,?,:])
kblock = (r[?,?,:], a[?,?,:], b[?,:,?], c[:,?,?])

I.e., you match the indexes with the :s.

One of the points perhaps worth making is that I was also looking for an analog of for i = 2:n..., and the ?/: syntax might generalize reasonably well there. But bad things happen if there are no ?, because then it would dispatch just to regular getindex.

All this is to simply explain my reasoning; while I'm attracted by solving the loop ordering problem, I recognize that this may not be possible/practical. (I don't know the literature here.)

eschnett commented 8 years ago

@timholy Okay; that's a straightforward transformation from the (:i,:j,:k) syntax I like. There's essentially one block per index, specifying where the index appears for each array.

meggart commented 8 years ago

Determine the optimal loop order. This is hard… and an endless task if we consider BLAS-like partial loop striding. I say we punt, and require the user to specify the order.

I just want to support @timholy 's point that I think (at least partially) solving the loop ordering problem would be very nice.Ootherwise one would have to accept that generic library functions like map! and copy! might be slow for transposed matrices, since the user would not be able to influence the loop order.

timholy commented 8 years ago

Perhaps I'm most interested in a stepwise approach: if possible, pick an API that doesn't prevent us from tackling this someday, but punt on loop-ordering for now. Revamping how we do array iteration is a big problem no matter how we slice it, and we might make faster progress by first choosing the smallest feasible unit and getting that working.

mbauman commented 8 years ago

I think that it still may be possible do something about the for loop ordering problem, but it's a bear. To do so, we'll need to use a bunch of flat iterators. Let's take a step back from syntax and type-stability, and just look at what kinds of iterators are required for a flat matrix multiplication loop:

for i=1:ni, j=1:nj, k=1:nk
   D[i,j] += A[i,k]*B[k,j]
end

Consider a vector that's been reshaped into a matrix. How many different iterators does this type need to implement in order to efficiently participate as either A, B, or D in the above expression?

As D: This is column major! Great, so we can just use a range 1:length(A). Oh, but wait, there's an inner k loop that we don't care about. We'll need a lazy version of repeat(1:length(A), inner=[nk]), which I don't think we have. But we can write that iterator wrapper once and be done.
As B: Shucks, row major traversal. Gotta use row-major cartesian iteration. Oh, and we also have to cycle it ni times. Easy enough, we have cycle.
As A: Ok, column major again, that's good. But… how in the world do we express this? It needs to produce chunks of nk elements at a time, repeating each chunk nj times before moving on to the next chunk. Ok, we can write that iterator, too. But what if there was a fourth loop? I'm sure there's a way to do this generally, but man, it'll be a pain.

I think this demonstrates that there are two orthogonal concerns here. As far as the individual array goes, it's really only concerned about traversal ordering. couple (or whatever) can wrap those iterators with the required wrapper in order to comply with the structuring of the other loops.

So what kind of information do we need to dynamically construct those iterators with full generality? An eachindex API would just need to know the ordering of dimensions. couple needs to know which indices are used in each array so it can arrange for an ordering and wrap iterators appropriately.

for (iD, iA, iB) in couple(D, (:i,:j), A, (:i,:k), B, (:k,:j))
   D[iD] += A[iA]*B[iB]
end

Now with this API, I can finally begin to imagine how coupled would work. It can try to find an ordering of ijk that satisfies the most arrays, then request the ordered index iterator from each array, then wrap it with the required iterator to reflect the chosen for-loop structure. Of course, with three-dimensional arrays, there'd also be partially optimal orderings. And this will almost certainly be dependent upon array size, too. Can we make JuMP a dependency of array iteration?

I think this is the fully general, fully orthogonal design. If we break this orthogonality, I'm afraid that we'll never stop writing different flavors of eachindex iterators for every single array type.

timholy commented 8 years ago

Can we make JuMP a dependency of array iteration?

I was also wondering if we'd need an optimizer. I doubt that's a direction we want to explore, unless it would also be useful for making more rational decisions about automatic inlining (which it surely would be). Punting to Polly, by marking a block or function with a future @polly hint, seems like a better way to handle this problem.

timholy commented 8 years ago

@mbauman, focusing on the reshaped-vector part of your comment above, a lot of the concerns can be resolved if we have an API for iterating along a particular dimension, aka #15459. This is a different form of orthogonality than you're talking about, and it would resolve the need for many different variants of repeat. It might also be easier to write: e.g., for a SparseMatrixCSC, as demonstrated in #15459 the notion of a "column iterator" and "row iterator" are much more straightforward than supporting some kind of repeat iterator that can start and stop anywhere. For ReshapedArrays, single-dimension iterators are annoying but straightforward; as described in the blog post, even when they require a div for construction it greatly reduces the cost of the whole loop.

A second issue is that fully-orthogonal iterators may not generalize well to efficient sparse iteration: you might want your sparse array iterator to "jump" to the next stored entry, and you might need to correspondingly advance any coupled iterators. This kind of thinking is what first suggested to me that something like couple is an important addition to the API: zip provides complete independence, but in zip(eachstored(A), eachstored(B)), if A is sparse and B is dense, the two will rapidly get out-of-sync. In contrast, in couple(eachstored(A), eachstored(B)), the state for B might be updated in a manner "slaved" to the changes in A. It would be nice to make this symmetric, so that no one iterator is "in charge," but I haven't thought that through yet. If it is symmetric, then couple isn't actually expressive enough on its own: in the example above, if A and B have different numbers of stored elements, should it visit the location of only those that are stored in both? Or stored in either one? At a minimum there's an ||/&& problem here.

I suspect that array-iterators need to be a specialized class, supporting more operations than just start, next, and done. Thinking about the sparse case, one candidate is skip, where the skip-value could be positive or negative. This won't be terribly efficient if skip requires calling div (which it would for something like a ReshapedArray). But I think that's OK; we can't have everything. I'll be happy with 90% of everything :smile:.

tkelman commented 8 years ago

In my mind you would express couple(eachstored(A), eachstored(B)) as either union or intersect depending on the operation being performed.

timholy commented 8 years ago

Thanks to #13412, I'm noticing that this is also possible: couple(any, x, y) or couple(all, x, y). I'm not sure you'd really use the any and all functions directly, but you can dispatch on them:

@eval foo(::$(typeof(any))) = 1
@eval foo(::$(typeof(all))) = 2

timholy commented 8 years ago

To help see what different APIs "feel like" for writing code, as an alternative to endlessly creating gists I opened ArrayIterationPlayground and invite collaborators. This doesn't (yet) contain any implementations of any API, but there's one file that implements a couple of algorithms using a fictitious API.

I'd recommend we keep discussion mostly here, though, since it's presumably of general interest.

mbauman commented 8 years ago

The design space here is just ginormous even before you begin considering sparse arrays. I'm trying to bring a dose of reality to all the magic that's flying around. :)

My knee-jerk reaction is that trying to extract a single cartesian index out of a given iterator to share with another array will prove to be intractable to code generically. There's two operations that may or may not be possible: eachindex iterators may or may not be able to quickly compute the cartesian indices, and the array that index is shared with may or may not be able to use a cartesian index (especially if you consider OffsetArrays, but I'm still not convinced we should support them without requiring a special offset index type). So you've got to support independent iteration no matter what.

I don't think you want to express this as foo[:, ?], since, to my reading, that'd require foo (be it an array or index iterator) to conjure up a wholly independent iterator object. Better to pass the shared index information separately and let couple sort it out.

As far as eachindex over particular dimensions goes, I'm in support. One possibility to allow for non-column major access is to support something like eachindex(A, Order{2}(:), 42, Order{1}(2:size(A,3)) — that'd iterate over (1, 42, 2), (1, 42, 3), (1, 42, 4), …, (2, 42, 2). Expressing this with dispatch in a sensible way worries me… as does specializing it for each iterator type. And I don't think it helps couple simplify any of the work it requires, though, since it's still got to start and stop these iterators at the right points for a flat loop to work.

All that said, I'd be very happy to be proven wrong.

timholy commented 8 years ago

Reality is good. I agree that the challenge here is to define the impactful contribution that's feasible and doesn't prevent more progress by future generations.

The implementation of foo[:,?] I had in mind is just

immutable AnyEntry end
const ? = AnyEntry()

# Typically AA<:AbstractArray, but not sure I yet see a reason to require that
immutable DimSelection{AA,D<:Tuple{Vararg{Union{Colon,AnyEntry}}}}
    A::AA
end

@inline getindex(A, I::Union{Colon,AnyEntry}...) = DimSelection{typeof(A), I}(A)

which is just basically a syntax for passing it on, as you say. I'm not sure whether it's a readability-enhancer, however, and the explicit eachindex is definitely on the table. (It's basically a "lazy eachindex" since it doesn't yet make a choice about which particular iterator type gets chosen.)

And yes, writing couple is going to be hard. Before trying that (with couple or any other API), I'm beginning to think the best thing is to see how well the API works before trying to implement it. The recent PRs from several kind folks addressing the TODO list in #15434 are a huge asset, since they help highlight "interesting" iteration cases.

timholy commented 8 years ago

To take out a little of the magic, I'm currently imagining something vaguely like this:

immutable CoupledIterator{I,F<:Tuple{Vararg{Function}}}
    iter::I
    itemfuns::F
end

This has a "master" iterator iter, and a tuple-of-functions that produces the item for each of the coupled iterators. next might be defined generally as just

next(iter::CoupledIterator, state) = map(iter.itemfuns, state), next(iter.iter, state)

or something. That map is mapping a tuple of functions over a single object state, which is kinda the opposite of our usual meaning, but hopefully you get the drift. The point is, this produces an item for each of the arrays that went into the iteration. Since it can be of any type, it has at least the potential to allow for efficient access.

As an example, for sparse matrix-vector multiplication A'*v (so we're taking a transpose), I could imagine that couple(all, eachstored(A, (:,j)), v) would have a default fallback behavior of something like

function couple(::typeof(all), iter::SparseCSCColumnIterator, v::AbstractVector)
    CoupledIterator(iter, state->state, state->state.row)
end

This doesn't work out all the details, but it suggests that couple maybe isn't as hard as feared. What this achieves is "slaving" the index for v to the iterator that only visits the stored items of A.

Now, if v is a ReshapedArray or itself sparse, this won't be optimal. I'm prepared to live with that in exchange for the benefit that one gets from handling the sparse matrix sensibly. I'm hoping that this kind of foundation will (1) give a big improvement over current behavior, (2) may not be an impossible dream, and (3) might be an API that's flexible enough that it won't stand in the way of anyone wanting to tackle more ambitious cases some day.

Does that help? Or did you already read my mind (so this is all old news to you) and then see another few miles ahead to big scary roadblocks?

mbauman commented 8 years ago

It does help some, yes. I had been thinking you were prioritizing reshaped and offset arrays over eachstored.

I think the couple that you propose in that latest example can be more generally expressed as:

couple(iter, ::Union{Val{:row}, Val{:col}}...) # or dim numbers, or in the value domain, doesn't matter

Now you can also get the column index for the destination vector: (ia, iv, idest) = couple(eachstored(A), Val{:row}, Val{:col}). Heck, make those Val arguments tuples and now you can allow coupling A element-wise to another matrix of the same size (or transpose). That's tractable. But only A gets special handling. All other arrays used here get indexed by cartesian indices.

Where I get all fuzzy is how to extend this to algorithms like gemm, where there's no one array in charge of the entire triplet of loops. Or in allowing the slave arrays to have a say in the kind of iterator they're indexed by. It's those cases where I think I can see a combinatorial explosion of methods with crazy ambiguity problems coming over the horizon.

timholy commented 8 years ago

OK, over at ArrayIterationPlayground I've written out a number of algorithms in a proposed API, and that experience seem to suggest we're converging. (It's possible I wasn't consistent, as my API plans evolved as I tried new things.) I found that this API may suffice to implement things as complicated as cholfact!, provided that we only care about arrays with numeric indexes and not arrays indexed as A[:dog, :cat].

Here's a summary of where I am now:

inds(A, d)               # index vector for dimension d (default 1:size(A, d))
inds(A)                  # tuple-of-inds
zeros(Int, -3:3, -1:5)   # creates matrix of size 7-by-7 with the given inds
fill(val, indexes...)    # likewise

a[first+1:(last+first)÷2]          # copy (or view) of the first half of array, skipping the first element

# `index` and `stored` return "indexing hints," the laziest of wrappers
index(A, :, j)          # lazy-wrapper indicating that one wants indexes associated with column j of A
stored(A, :, j)         # just the stored values of A in column j
index(stored(A, :, j))  # just the row-indexes of the stored values in column j
index(A, :, ?)          # row-index iterator for an arbitrary (unknown) column of A
index(A, Dimension{2})  # similar to index(A, ?, :, ?...)

# Likely: (notice this is different from sub, because this keeps original indexes)
index(A, first+1:last)  # would iterate over 2:length(A) for a typical Array
index(A, first+1:last, j)  # like above, but just over indexes for dimension 1

couple(iter1, iter2)    # iterates over iter1, iter2 containers, keeping them in sync
# Do we want/need this? I suspect not (default would be `any`)
couple(any, stored(iter1), stored(iter2))  # visits i if either iter1[i] or iter2[i] has a value
couple(all, stored(iter1), stored(iter2))  # visits i if both iter1[i] and iter2[i] have values

icat(a, b)  # index-preserving concatenation
# For example:
#    icat(7:10, 3)  7:10 is indexed with 1:4, so this creates a vector indexed from 1:5 (numbers aren't tied to an index)
#    icat(3, 7:10)  creates a vector indexed from 0:4
#    icat(5:7, 2:4) is an error, because they have overlapping indexes 1:3
#    icat(5:7, OffsetArray(2:4, 4:6))  indexed from 1:6
#    icat(5:7, OffsetArray(2:4, 5:7))  an error, non-contiguous indexes

Other than the Order type proposed by @mbauman above, I think this incorporates most of the feedback so far. I might prefer to tackle storage order in a second pass, since this is already peering pretty deeply into the crystal ball. And by having iterators for specific dimensions, that may be less relevant.

I've got a lot of teaching to do in the immediate future, so am unlikely to make much further progress for a while. But unless I hear otherwise I may slowly start implementing pieces of this. It's a lot of new exports, so obviously this deserves some thought.

eschnett commented 8 years ago

This looks interesting. I'll probably implement them as well for https://github.com/eschnett/FlexibleArrays.jl.

Painting the bike shed: The name couple is too generic; it doesn't indicate how they are coupled. (When I read it first, it evoked a "truck-and-trailer" coupling, which would sequence the iterators -- also a nice idea.) Given that the function takes two iterators as arguments, it's kind of obvious that it combines them in some way, so that doesn't need to be part of the name. You could call it parallel, par, insync, synced, simultaneously, multi, union, intersect, zip, zipindex, izip, ...

timholy commented 8 years ago

Thanks for reminding me about FlexibleArrays, I knew I had been told about another "offset" array indexing package out there and just forgot what it was called and who wrote it. It's obviously further along than the one I've been referring to.

Anyway, on the bike shed: I agree couple is too vague, and I really like both insync and synced. zip is out because it already has a meaning different from couple: the iterators increment independently of one another, and that's opposite the intent here.

vtjnash commented 8 years ago

I'm not sold on the complexity of couple being necessary. But I haven't put anywhere near as much effort into this as you, so I'll just note my concerns. However, it's not clear to me that you would be able to define this functionality in a general way. Even if you could achieve 80% of the speed of a raw array in the basic case, I still think you are giving up the possibility of vectorizing and blocking in the kernel. At that point, I believe it becomes faster to make a copy first. That also means that the kernel library author only needs to ensure their algorithm is optimized (and tested) for regular Arrays and the array-subtype library author only needs to ensure that they have an optimized copy method.

@iterate B dest[i] += B[i,j]*v[j]

I'm fairly certain this is just special syntax for broadcast (with an accumulator?) or more generally, another implementation of @devec?

Likewise, copy! might be implemented

My sense for copy! is that the most sensible definition is:

for x in keys(src)
  dest[x] = src[x]
end

which is specifically not an iteration over dest (except in the common array case where the keys of dest were the same as the keys of src)

timholy commented 8 years ago

The main place you need something like couple is in generalizing sparse matrix algebra. If we don't have a generic iteration framework, we need to implement special methods for each of, e.g., Tridiagonal*Bidiagonal, Bidiagonal*Tridiagonal, Banded*SparseMatrixCSC, etc. Conversion to full is very expensive for such matrices.

Yes, now that I think arrays should be thought of as associative containers, I agree that's the right meaning of copy!. In my current thinking, keys(src) is equivalent to index(stored(src)), and could be supported for arrays (which it isn't currently).

mbauman commented 8 years ago

keys is particularly nice in that it doesn't guarantee an order to the iteration, possibly pointing a path towards a row-major solution.

vtjnash commented 8 years ago

If we don't have a generic iteration framework

I'm not against that, I'm only questioning whether couple is better than zip. From the top post, I understood that zip should be able to implement the algorithm, but might do a bit of extra work when computing the iterators? Re the initial motivating example of ReshapedArrays, I think the copy-then-kernel or generate-then-copy approach may be faster (and as a bonus resolves the aliasing question)

timholy commented 8 years ago

I should have clarified that your copy! implementation is what I think of as the right "meaning" of copy!, with one potential caveat: right now we support copy!(dest::Array{T,2}, src::Array{S,3}) as long as length(dest) >= length(src) and convert(T, s::S) works. A strict notion would probably favor forcing this to become something like copy!(vec(dest), vec(src)) or something (where vec creates a view). That would require that the array's indexes can be meaningfully converted to a linear index, which then gives a point of correspondence between the two arrays. Of course, for iterating through them, there will be cases where you might like something more efficient than a linear index. (For example, copy!(sub(A, 1:end-1, 1:end-1), src).) It's also a good example of where "just make a copy, that solves most performance problems" isn't viable: it might work for src, but not for dest.

Regarding couple vs zip, I guess my point isn't that one is better than the other, but that we need both because we want different behaviors in different circumstances. For something like computing ::SparseMatrixCSC * ::AbstractMatrix, I'm struggling to see how to use our regular zip---perhaps you can explain? As I'm seeing it, for zip I think the contract is that each iterator is independent of the other, and so I don't see a good way to skip over the non-stored elements of the first array without correspondingly "advancing" the iterator for the second.

Unless you're proposing that zip get some keyword arguments endowing it with flexible behavior? Something along those lines seems fine to me; I'm not trying to argue we need a new name, just that we need new functionality.

timholy commented 8 years ago

I also edited the top post to try to reduce confusion from the fact that the API proposal is being hammered out here, and hence the top post isn't very representative of my current thinking.

JaredCrean2 commented 8 years ago

I haven't been much involved in various stages of iterator development, but looking at #1032 sparked an idea: what if there was a callback function that each iterator gets passed to before get/setindex? This would abstract away the problem of iterators needing to know about things like whether the array indices start at 1, 0 or some other number. The iterator can say "I point to the 3rd element in the 5th row" and the callback function is responsible for transforming this into a value that get/setindex! will accept.

This could solve:

Non 1-based array indexing
Negative indexing (if this is desirable)
Jagged arrays (maybe - this might be more involved to solve completely)

The generic fallbacks for this callback function would define the "standard" array implementation (1-based indexing etc.), but users could provide specific methods for new AbstarctArray subtypes to get different behavior.

mbauman commented 8 years ago

That's basically what the status quo is with cartesian indices. In fact, when you index an Array with cartesian indices, it eventually gets reduced to a linear memory offset. There's a little extra overhead involved compared to just iterating over the linear range 1:length(A), but it's not terrible. Our array infrastructure was built upon the assumption that indexing with cartesian indices is relatively fast — either it's the fastest possible way to index into the array (what we call LinearSlow), or it can be converted to a linear index without much overhead (for LinearFast arrays).

That's not true, though, when ReshapedArrays wrap LinearSlow arrays. Accessing the elements of reshape(sprand(2,3,1.0), 6) as though it's a vector, for example, requires using division to figure out that the fifth element is actually pointing at (1,3) in the parent array. That's dog-slow, especially if you have to do it in every single loop. But if you know ahead of time that you're just going to iterate over all the elements in column major order, then you can be clever. With reshapes, the cleverness is just that the order of the elements is the same, regardless of the shape. So if you ask for an iterator over all indices, then ReshapedArray can just wrap the parent array's iterator, and it'll be just as fast as if it weren't reshaped at all. You, as the user, don't even need to know how it worked this magic: eachindex(A) just does it for you.

The trouble is that you don't always want to iterate over all indices, and not always in column-major order. If you're able to describe the access pattern up-front when you request the index iterator, then it allows for cleverness of this nature, even in more complicated nested loops with multiple arrays. And if we're able to always ask arrays for index iterators, then maybe this can safely allow offset arrays to define non-standard Int indexing.

JaredCrean2 commented 8 years ago

I see, thanks for the clear explanation. ReshapedArrays are a much harder problem that I was envisioning.

timholy commented 8 years ago

ReshapedArrays are a much harder problem that I was envisioning.

Only if you want to avoid performance regressions. If you don't care about that, they're really easy :smile:.

shashi commented 8 years ago

I came across @Jutho's TensorOperations.jl. The correspondence between the APIs proposed here and that in TensorOperations is fairly clear. As far as the DSL part of the solution to this issue goes, TensorOperations' DSL, seems to be good at expressing loop-ridden algorithms without committing to any implementation scheme or even a commitment to what the allowed indices are.

I wanted to point out this excerpt from the README that talks about a clever implementation strategy:

For add!, trace! and the native implementation of contract!, the problem is recursively divided into smaller blocks by slicing along those dimensions which correspond to the largest strides for all of the arrays. When the subproblem reaches a sufficiently small size, it is evaluated by a separate kernel using a set of nested for loops. The implementation depends heavily on metaprogramming and Julia's unique @generated functions to implement this strategy efficiently for any dimensionality. The minimal problem size is a constant which could be tuned depending on the cache size. The modularity of the implementation also allows to easily replace the kernels if better implementations would exist (e.g. when more SIMD features become available).

I think parallel implementations can exploit facilities of this divide-and-conquer strategy as well. Food for thought.

aaronsheldon commented 8 years ago

Perhaps I am way off base on this one but it seems to me that in its most general form this is an attempt to re-invent database joins, especially in the case of sparse matrices. But it has been know for some time that for large numbers of indexes (i.e. rows) it will be hard to beat hash tables on the specific (fields/columns) indexes being joined.

Essentially sparse matrix multiplication is a form of inner join, and sparse matrix addition is a form of outer join.

Put another way, while it may seem you want to march both arrays in some form of "simultaneous" manner, in the most general case, where you give up all prior knowledge of the memory layout, you would want to read one array in the most efficient linear manner possible, and then the other array, while using a pair of hashes that are dual to one another (e.g. hash 1 maps in column major order and hash 2 maps in row major order).

Sacha0 commented 6 years ago

Ref. https://github.com/JuliaLang/julia/blob/296d5728a2974ac1e3efaf358288c40b05e3abee/stdlib/SparseArrays/src/higherorderfns.jl#L32 for a little work in this direction. (Namely a hacky common interface for iteration over SparseVectors and SparseMatrixCSCs sufficient for the purposes of sparse map and broadcast.) Best!

StefanKarpinski commented 5 years ago

Is this issue still relevant after the revision of the iteration protocol?

vtjnash commented 5 years ago

Skimming quickly, I don't think it had anything to do with the protocol, but may have been enabled by the changes to the broadcast framework and perhaps motivating future work such as "For Loops 2.0: Index Notation And The Future Of Tensor Compilers by Peter Ahrens" https://www.youtube.com/watch?v=Rp7sTl9oPNI

I don't think we need an issue open though, if there's nothing to change specifically

tkf commented 5 years ago

(Oops, I've been preparing to post a possible new solution here then I just noticed that it's closed. But please allow me to continue the discussion. I think "how to use the best/a good-enough iteration strategy for a given set of containers?" is still an issue.)

I'd like to suggest an approach to this problem "dual" to the iterator-based APIs: use foldl (or reduce) and foreach derived from it for complex loops such as the ones discussed here. The basic API I came up with is:

using NDReducibles: ndreducible
using Referenceables: referenceable

# Implementing C += A * B
foreach(
    # Create a "virtual container" (reducible) where the indices are coupled
    # based on the specification below:
    ndreducible(
        referenceable(C) => (:i, :j),  # mark that we want to mutate C
        A => (:i, :k),
        B => (:k, :j),
    ),
) do (c, a, b)
    c[] += a * b
end

This tries to choose the best nesting order of the for loop. ATM, it only supports DenseArray and alike composed with Transpose/Adjoint and Broadcasted. A few more demos and docs can be found here: https://tkf.github.io/NDReducibles.jl/dev/ (and https://tkf.github.io/Referenceables.jl/dev/ ) Note that, as you can see, I haven't optimized the terseness of the syntax yet.

This API is based on Transducers.jl's version of foldl. It means that it is as powerful as iterator-based protocol in terms of the semantics. That is to say, things like early termination ("break") and filtering also work. However, by completely going to a "callback"-based API, the "loop executor" has much more control than the iterator-based API. For example, it's possible to implement different parallel execution strategies based on the given set of containers (without macro magics). Fancy iterations like the one based on space filling curve is also possible. In fact, callback-based API naturally fits well with the divide-and-conquer approach. Furthermore, it allows you to define copying-to-Array-first as a valid strategy if the combination of the given custom arrays is too complicated to handle (of course, it may need to do a quick memory estimate and use this path only when the copies are not huge). Similarly, references to an element to arrays (as c[] above) do not have to directly invoke getindex/setindex!. This could be an interesting optimization when none of the indices for the destination array(s) are used in the inner-most loop (i.e., read once in an outer loop, put it in a RefValue, use it in the inner loops, and then write it back in the outer loop). [1]

One point of view is that this API is a generalization of the broadcasting API. That is to say, copyto! can be derived from foldl or reduce but not the other way around. So, I think it is worth considering to formalise foldl/reduce as the basic interface that is used from broadcasting as well as mapping, filtering, and reduction. Then, complex loops like GEMM can also be derived as a foreach (side-effect-only mapping).

[1] the last few points are actually doable in zip-like iterator-based interface as well (but not with foreach-like interface)

JuliaLang / julia

Julep: solving tricky iteration problems #15648

The problem (a brief summary)

Possible APIs

The "raw" API

Making it prettier with "sufficiently smart iteration"