Open rupanshusoi opened 4 years ago
Small name change. Trying to call these 'compact' instances instead of 'compressed'.
Ok, I'm going to brainstorm some of the potential issues:
The main thing right now is that Regent's codegen directly emits all accessor code. Certain assumptions are baked into the compiler. The assumptions that I'm aware of are:
base + index * stride
. This is maximally efficient for our current case but won't work for sparse accessors. We could replace this with a C API call for the sparse case, but that's not ideal, partly because we'd like to inline the call (and we can't inline C API calls right now), but also partly because there are domain-specific optimizations that we can perform.The reason that just calling a C API is not ideal is that there are certain cases where the accessor can be optimized away. In particular, if you write code like:
for x in r do
r[x] -- case (a)
r[f(x)] -- case (b)
end
Case (a) can be optimized, since you're always walking things contiguously. In this case we can bake the sparsity into the loop and avoid the sparse accessor completely, thus making the loop "almost" free. However, case (b) still requires the sparse accessor. So you can't just blindly assume everything falls into case (a).
There's an interesting question of what belongs with case (a) and what goes with (b). Originally I was thinking that only r[x]
can go into case (a), but I think this is actually too strict. An expression like r[x+1]
, if it is valid at all, is guaranteed to be inside the same "chunk" of the accessor. Therefore we could optimize it to take the fast path. I don't know how common this is, but I just mention it since it may apply to more than just the simple case of +1
and -1
. It's worth some thought trying to figure out the exact scope of what expressions we'll handle.
That's all I can think of for now but I know others have ideas on this, so @streichler @magnatelee please chime in if I've missed anything.
An expression like r[x+1], if it is valid at all, is guaranteed to be inside the same "chunk" of the accessor.
This statement will hold for 1-D indexing, but not for higher dimensions. And anything more than +/-1 isn't guaranteed to work even for 1-D.
I can see that if the algorithm for choosing the chunks is unconstrained. But I think it would be an interesting to ask whether there's a useful tradeoff here if we allow some constraints on the chunking algorithm in order to permit optimizations on the generated code; I'm willing to believe that the bloat in memory usage may well be worth it for some cases.
But having said that, I think we need some concrete use cases before we attempt to design anything around that. r[x]
is a must, anything beyond that is wishful thinking until someone actually needs it.
For a first pass implementation, I think case (a) should just turn into a doubly nested loop over Legion's SpanIterator
objects, with the inner loop being the current Regent loop over a pointer with a fixed stride (e.g. the SpanIterator
will give you back a Span
object which is just a pointer and a stride like Regent normally assumes). I think case (b) should fall back to using the MultiAffineAccessor
specialization, which will be correct, but slower for now.
I'm going to try to sketch out a basic design for compact sparse instances in Regent, with a couple possible variations (because not everything is nailed down yet). Starting with the least controversial decisions first:
At least when its operating in "sparse" mode, it's obvious that Regent's code generator needs to distinguish between centered and uncentered accesses. That is:
for x in r do
r[x].f = ... -- centered
r[fn(x)].f = ... --uncentered
end
Basically, anything other than the bare loop iterator x
is uncentered. While I noted some (very, very) limited cases of more complex functions that can be assumed to be centered due to the rules of sparsity (see https://github.com/StanfordLegion/legion/issues/878#issuecomment-646891901), I don't think these are worth worrying about now (or probably ever).
Regent's code generator will expand this into something that roughly corresponds to the following pseudocode:
var slow_accessor = get_slow_accessor(r)
for t in r.tiles do
var base = get_base_pointer(t)
var stride = get_stride(t)
for x in t do
*(base + x*stride) = ... -- fast path for the centered access
*legion_accessor_array_get_pointer(slow_accessor, fn(x)) = ... -- slow path for uncentered access
end
end
This should work out of the box with all existing Regent codes, if those codes have their mappers changed to use compact sparse instances. This is nice because it preserves the functional correctness of these codes. However, it will create a bit hit for codes that currently use dense accessors. This impacts basically all of our proxy and production codes (Circuit, Stencil, Pennant, etc.). So I'm not willing to just turn this on for all codes.
I can think of a couple of possible workarounds:
-fsparse 1
which is off by default in Regent. Without this flag, Regent works like it currently does. Support for compact sparse instances is enabled only if the flag is enabled, and it is expected that existing codes will take a hit. However, this is a pretty big hammer since it has to be enabled or disabled at the level of an entire compilation unit.__demand
or __forbid
flag. More fine-grained but now we're littering the code with annotations. I'm not sure I like this option.Regardless of what we pick, it seems obvious to me that we also need better abstractions for dealing with sparsity directly. I can imagine cases in user code where compact sparse instances are desired, but it's still possible to do local operations on dense blocks of data (e.g., matrix multiply operations). This can't happen in the design above with such a hard split between centered and uncentered accesses.
The basic idea here is to add an intermediate object that users can explicitly iterate over. I've got two ideas here, with different pros and cons:
An example code might look like the following. (Note that this is code written by the user, not compiler generated.)
for t in r.tiles do
for x in t do
r[x].f = ... -- centered
r[x+1].f = ... -- also centered, note that user is responsible for boundary conditions
r[0].f = ... -- uncentered, falls back to slow path
end
end
This can be rewritten to use domains
rather than tiles
if we go that route. (Name subject to change.)
You can see this has a pretty close correspondence with what I'm planning to generate in the compiler, which is another plus.
One implementation detail is the type of r.tiles
(or domains
): I think this will be an "iterator" (perhaps literally iterator(rectNd(r))
or similar) because I don't think we currently want to allow users to do anything other than iterate on it.
I'm happy to take any feedback, but I'm especially looking to get comments on the proposal (1) - (3) above and the abstractions of tiles vs domains, and particularly from interested potential users of this feature on whether this works for their use cases.
Some other plausible scenarios that might come up (users in particular, please comment if you think you might see these in your code):
r[x+1]
or r[x+{1,1}]
, etc. Regent could plausibly optimize these by splitting the tile loop to separately iterate interior and exterior points and emitting centered accesses for all interior iterations. Having said that, users would need some way to handle the boundary conditions and it's not 100% obvious to me how they'd do that in the general sparse case.for x in interior
where interior
is not the index space of region r
. Stencil uses this idiom as a way to handle boundary conditions. The code generator should be general enough to handle this, but the bigger question is whether we can optimize it: without understanding that interior
is r.ispace
shrunk by the stencil pattern, I don't know how we'd do that. (And I don't know what analysis would be able to prove that.)r[...].f
, r[...].g
and r[...].h
shouldn't require separate lookups if the ...
is the same for all of them; we should be able to amortize this to do it once per loop.The bigger question being exposed by these is whether there is a larger abstraction that is (a) general enough to handle what users care about, (b) predictable enough that users can use it with confidence, knowing that it will do what they want, and (c) is reasonable to implement in the compiler. Currently the space of things that might plausibly be supported is large enough that it's a bit of challenge to figure out what meets these three requirements.
I'm definitely a fan of having the dense/compact/maybe-other-stuff-in-the-future choice being a per-region thing (i.e. option 3).
I'm not wild about exposing "domains" to the application programmer because they don't really mean anything in the regent semantics. The tiles
iterator makes sense because it allows you to give chunks of work to some ffi'd thing that only works on rectangles (e.g. all blas libraries ever), and I think we could/should just make the tiles iterator do all the tiles in a chunk before going on to the next chunk.
For folks that want to manually optimize affine access for uncentered references in their own code, my preference is for something like what the old get_strides stuff did - in addition to the base pointer and strides you get the rectangle for which those strides are valid and it's on the application programmer to check them and handle boundaries.
Coming back to the original example where regent does most of the work, I do think something like this:
for x in interior do
r[x].df = r[x+1].f - r[x-1].f
end
should be able to automatically turn into something like this:
for domain in interior.domains do
{ df_base, df_strides } = get_strides(df, domain)
{ f_base, f_strides } = get_strides(f, domain)
for t in domain.tiles do
safe_x_lo = max(domain.x.lo + 1, t.x.lo)
safe_x_hi = min(domain.x.hi - 1, t.x.hi)
-- fast interior loop
for x in [safe_x_lo, safe_x_hi] do
*(df_base + x * df_strides) = *(f_base + (x + 1) * f_strides) - *(f_base + (x - 1) * f_strides)
end
-- slower edges, if needed
if(t.x.lo < safe.x.lo) then
for x in [safe_x_lo, safe_x_hi] do
*(df_base + x * df_strides) = r[x + 1].f - r[x - 1].f -- the x+1 is technically safe here, but this is the slow path
end
end
-- same for safe_x_hi side
end
Even if this only works for bounded-integer offsets (see the Halide paper for examples where supporting dynamic offsets with a static bound is useful), I think it's an important optimization that our story should permit even if we don't implement it right away.
I agree with Sean in that fine-grained control over which regions are sparse would be great. (3) sounds ideal, but if it's not going to be implemented right away then (2) seems to be the next best option.
If in the domain
case you are going to group all valid tiles into a single domain, then it might be a better abstraction than tiles
because it would let me do for pt in r.domains
(or similar) which would use compact sparse instances and would be pretty much identical, at least from a user perspective, to for pt in r
(what we do now).
I think the pattern of iterating over a different index space may be quite common for tasks that take multiple region arguments:
for color in r.colors do
foo(r[color], s[color])
end
My codebase also has instances of the 4th bullet -- accessing different fields of the same point.
Elliott, thanks for drafting this design. It looks good from the point of view of the potential implementation in HTR. It would most likely exercise some of the features that you have described:
I think there is another case here that is missing and comes up in practice with some applications like Pennant:
for x in r do
r2[fn(x)].f = g(x)
end
Where r
is "dense", but r2
is sparse which is different than the cases above because the index space of r2
is not being used directly by the loop. In many cases, Regent can still make a fast AffineAccessor
for r2
. In other cases, it will need to make a MultiAffineAccessor
for doing lookups into which piece of r2
is pointed to by fn(x)
. It would be bad to just have everything fall back to the MultiAffineAccessor
. You should be able to do all these tests before the loop so you can amortize the cost over the loop if you don't want to make the user deal with it, although that will mean that Regent will be generating twice as much code as is probably necessary.
Another optimization that would be good to handle: in the case where we do something like:
for x in r do
r[x].f = ... -- centered
end
where all accesses are "centered". It would be good to generate code that uses the span iterator instead of iterating tiles so that we can handle the case where the source index space is very sparse and contains lots of very small tiles.
@rupanshusoi and @mariodirenzo, it would help to get some very, very simple code samples (just a couple lines, like what we've been sharing up above) to describe what you're planning on doing. In particular, I'd like to come to an understanding of exactly how complicated the actual patterns you'll want to be running are.
In particular, a naive r[x+1]
or similar just isn't going to work in general, because it doesn't take into account the boundary conditions. (It would be easy to do an out of bounds access if you're not careful.) So I'm thinking that HTR's case will need something more complicated, like:
for t in interior.tiles do
for x in t do
r[x].f = r[x+1].g + ... -- stencil access pattern is safe because `t` is a shrunk subset of a tile of `r`
end
end
In particular, interior
is r
shrunk by the width of the stencil, so that the interior accesses are safe. (There would presumably have to be a separate exterior loop, but I'm not showing that here.) This is just more complicated for the compiler to handle because it's not immediately obvious that there's any relationship between interior
and r
. That's why I'd like to understand what exact patterns are required, because
@mariodirenzo Please clarify whether the use case described in #1057 is the motivation for requesting this feature, or whether you have any other use cases in addition to that. Thanks.
Please clarify whether the use case described in #1057 is the motivation for requesting this feature, or whether you have any other use cases in addition to that. Thanks.
I've also another use case, which is a periodic domain whose stencil calculation warp around its bounds. Though, I think that #1057 could address both use cases
As @elliottslaughter asked, creating an issue to discuss the design and implementation of Regent support for the newly added compressed sparse instances in Legion and Realm.