StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
675 stars 146 forks source link

Regent: support compact sparse instances #878

Open rupanshusoi opened 4 years ago

rupanshusoi commented 4 years ago

As @elliottslaughter asked, creating an issue to discuss the design and implementation of Regent support for the newly added compressed sparse instances in Legion and Realm.

lightsighter commented 4 years ago

Small name change. Trying to call these 'compact' instances instead of 'compressed'.

elliottslaughter commented 4 years ago

Ok, I'm going to brainstorm some of the potential issues:

The main thing right now is that Regent's codegen directly emits all accessor code. Certain assumptions are baked into the compiler. The assumptions that I'm aware of are:

  1. There is one base pointer, and one stride, per field per region.
  2. A consequence of (1) is that initialization for accessors can be done at the top of a task. An "accessor" in Regent is basically just a base pointer and stride per field.
  3. I'm not aware that Regent ever assumes anything about the strides between fields (e.g. that the strides of array fields are equal, so that they only require a single base pointer), but I could be wrong. I'll need @magnatelee to confirm this one. Even if it's not something we do today though, it may be something we want to do in the future.
  4. At the point where a region is accessed, Regent hard-codes the access as base + index * stride. This is maximally efficient for our current case but won't work for sparse accessors. We could replace this with a C API call for the sparse case, but that's not ideal, partly because we'd like to inline the call (and we can't inline C API calls right now), but also partly because there are domain-specific optimizations that we can perform.

The reason that just calling a C API is not ideal is that there are certain cases where the accessor can be optimized away. In particular, if you write code like:

for x in r do
  r[x] -- case (a)
  r[f(x)] -- case (b)
end

Case (a) can be optimized, since you're always walking things contiguously. In this case we can bake the sparsity into the loop and avoid the sparse accessor completely, thus making the loop "almost" free. However, case (b) still requires the sparse accessor. So you can't just blindly assume everything falls into case (a).

There's an interesting question of what belongs with case (a) and what goes with (b). Originally I was thinking that only r[x] can go into case (a), but I think this is actually too strict. An expression like r[x+1], if it is valid at all, is guaranteed to be inside the same "chunk" of the accessor. Therefore we could optimize it to take the fast path. I don't know how common this is, but I just mention it since it may apply to more than just the simple case of +1 and -1. It's worth some thought trying to figure out the exact scope of what expressions we'll handle.

That's all I can think of for now but I know others have ideas on this, so @streichler @magnatelee please chime in if I've missed anything.

streichler commented 4 years ago

An expression like r[x+1], if it is valid at all, is guaranteed to be inside the same "chunk" of the accessor.

This statement will hold for 1-D indexing, but not for higher dimensions. And anything more than +/-1 isn't guaranteed to work even for 1-D.

elliottslaughter commented 4 years ago

I can see that if the algorithm for choosing the chunks is unconstrained. But I think it would be an interesting to ask whether there's a useful tradeoff here if we allow some constraints on the chunking algorithm in order to permit optimizations on the generated code; I'm willing to believe that the bloat in memory usage may well be worth it for some cases.

But having said that, I think we need some concrete use cases before we attempt to design anything around that. r[x] is a must, anything beyond that is wishful thinking until someone actually needs it.

lightsighter commented 4 years ago

For a first pass implementation, I think case (a) should just turn into a doubly nested loop over Legion's SpanIterator objects, with the inner loop being the current Regent loop over a pointer with a fixed stride (e.g. the SpanIterator will give you back a Span object which is just a pointer and a stride like Regent normally assumes). I think case (b) should fall back to using the MultiAffineAccessor specialization, which will be correct, but slower for now.

elliottslaughter commented 3 years ago

I'm going to try to sketch out a basic design for compact sparse instances in Regent, with a couple possible variations (because not everything is nailed down yet). Starting with the least controversial decisions first:

At least when its operating in "sparse" mode, it's obvious that Regent's code generator needs to distinguish between centered and uncentered accesses. That is:

for x in r do
  r[x].f = ... -- centered
  r[fn(x)].f = ... --uncentered
end

Basically, anything other than the bare loop iterator x is uncentered. While I noted some (very, very) limited cases of more complex functions that can be assumed to be centered due to the rules of sparsity (see https://github.com/StanfordLegion/legion/issues/878#issuecomment-646891901), I don't think these are worth worrying about now (or probably ever).

Regent's code generator will expand this into something that roughly corresponds to the following pseudocode:

var slow_accessor = get_slow_accessor(r)
for t in r.tiles do
  var base = get_base_pointer(t)
  var stride = get_stride(t)
  for x in t do
    *(base + x*stride) = ... -- fast path for the centered access
    *legion_accessor_array_get_pointer(slow_accessor, fn(x)) = ... -- slow path for uncentered access
  end
end

This should work out of the box with all existing Regent codes, if those codes have their mappers changed to use compact sparse instances. This is nice because it preserves the functional correctness of these codes. However, it will create a bit hit for codes that currently use dense accessors. This impacts basically all of our proxy and production codes (Circuit, Stencil, Pennant, etc.). So I'm not willing to just turn this on for all codes.

I can think of a couple of possible workarounds:

  1. Add a flag -fsparse 1 which is off by default in Regent. Without this flag, Regent works like it currently does. Support for compact sparse instances is enabled only if the flag is enabled, and it is expected that existing codes will take a hit. However, this is a pretty big hammer since it has to be enabled or disabled at the level of an entire compilation unit.
  2. Same as (1) but via a __demand or __forbid flag. More fine-grained but now we're littering the code with annotations. I'm not sure I like this option.
  3. Add proper support for variants. This is more long-term, but it's possible that this could let the user specify, e.g., at the level of individual region arguments, which ones should be considered sparse. The Regent optimizer can then take advantage of this information to emit fast code for the ones that aren't.

Regardless of what we pick, it seems obvious to me that we also need better abstractions for dealing with sparsity directly. I can imagine cases in user code where compact sparse instances are desired, but it's still possible to do local operations on dense blocks of data (e.g., matrix multiply operations). This can't happen in the design above with such a hard split between centered and uncentered accesses.

The basic idea here is to add an intermediate object that users can explicitly iterate over. I've got two ideas here, with different pros and cons:

An example code might look like the following. (Note that this is code written by the user, not compiler generated.)

for t in r.tiles do
  for x in t do
    r[x].f = ... -- centered
    r[x+1].f = ... -- also centered, note that user is responsible for boundary conditions
    r[0].f = ... -- uncentered, falls back to slow path
  end
end

This can be rewritten to use domains rather than tiles if we go that route. (Name subject to change.)

You can see this has a pretty close correspondence with what I'm planning to generate in the compiler, which is another plus.

One implementation detail is the type of r.tiles (or domains): I think this will be an "iterator" (perhaps literally iterator(rectNd(r)) or similar) because I don't think we currently want to allow users to do anything other than iterate on it.

I'm happy to take any feedback, but I'm especially looking to get comments on the proposal (1) - (3) above and the abstractions of tiles vs domains, and particularly from interested potential users of this feature on whether this works for their use cases.

elliottslaughter commented 3 years ago

Some other plausible scenarios that might come up (users in particular, please comment if you think you might see these in your code):

The bigger question being exposed by these is whether there is a larger abstraction that is (a) general enough to handle what users care about, (b) predictable enough that users can use it with confidence, knowing that it will do what they want, and (c) is reasonable to implement in the compiler. Currently the space of things that might plausibly be supported is large enough that it's a bit of challenge to figure out what meets these three requirements.

streichler commented 3 years ago

I'm definitely a fan of having the dense/compact/maybe-other-stuff-in-the-future choice being a per-region thing (i.e. option 3).

I'm not wild about exposing "domains" to the application programmer because they don't really mean anything in the regent semantics. The tiles iterator makes sense because it allows you to give chunks of work to some ffi'd thing that only works on rectangles (e.g. all blas libraries ever), and I think we could/should just make the tiles iterator do all the tiles in a chunk before going on to the next chunk.

For folks that want to manually optimize affine access for uncentered references in their own code, my preference is for something like what the old get_strides stuff did - in addition to the base pointer and strides you get the rectangle for which those strides are valid and it's on the application programmer to check them and handle boundaries.

Coming back to the original example where regent does most of the work, I do think something like this:

for x in interior do
  r[x].df = r[x+1].f - r[x-1].f
end

should be able to automatically turn into something like this:

for domain in interior.domains do
  { df_base, df_strides } = get_strides(df, domain)
  { f_base, f_strides } = get_strides(f, domain)
  for t in domain.tiles do
    safe_x_lo = max(domain.x.lo + 1, t.x.lo)
    safe_x_hi = min(domain.x.hi - 1, t.x.hi)
    -- fast interior loop
    for x in [safe_x_lo, safe_x_hi] do
      *(df_base + x * df_strides) = *(f_base + (x + 1) * f_strides) - *(f_base + (x - 1) * f_strides)
   end
   -- slower edges, if needed
   if(t.x.lo < safe.x.lo) then
      for x in [safe_x_lo, safe_x_hi] do
        *(df_base + x * df_strides) = r[x + 1].f - r[x - 1].f  -- the x+1 is technically safe here, but this is the slow path
     end
   end
   -- same for safe_x_hi side
end

Even if this only works for bounded-integer offsets (see the Halide paper for examples where supporting dynamic offsets with a static bound is useful), I think it's an important optimization that our story should permit even if we don't implement it right away.

rupanshusoi commented 3 years ago

I agree with Sean in that fine-grained control over which regions are sparse would be great. (3) sounds ideal, but if it's not going to be implemented right away then (2) seems to be the next best option.

If in the domain case you are going to group all valid tiles into a single domain, then it might be a better abstraction than tiles because it would let me do for pt in r.domains (or similar) which would use compact sparse instances and would be pretty much identical, at least from a user perspective, to for pt in r (what we do now).

I think the pattern of iterating over a different index space may be quite common for tasks that take multiple region arguments:

for color in r.colors do
  foo(r[color], s[color])
end

My codebase also has instances of the 4th bullet -- accessing different fields of the same point.

mariodirenzo commented 3 years ago

Elliott, thanks for drafting this design. It looks good from the point of view of the potential implementation in HTR. It would most likely exercise some of the features that you have described:

lightsighter commented 3 years ago

I think there is another case here that is missing and comes up in practice with some applications like Pennant:

for x in r do
  r2[fn(x)].f = g(x)
end

Where r is "dense", but r2 is sparse which is different than the cases above because the index space of r2 is not being used directly by the loop. In many cases, Regent can still make a fast AffineAccessor for r2. In other cases, it will need to make a MultiAffineAccessor for doing lookups into which piece of r2 is pointed to by fn(x). It would be bad to just have everything fall back to the MultiAffineAccessor. You should be able to do all these tests before the loop so you can amortize the cost over the loop if you don't want to make the user deal with it, although that will mean that Regent will be generating twice as much code as is probably necessary.

lightsighter commented 3 years ago

Another optimization that would be good to handle: in the case where we do something like:

for x in r do
  r[x].f = ... -- centered
end

where all accesses are "centered". It would be good to generate code that uses the span iterator instead of iterating tiles so that we can handle the case where the source index space is very sparse and contains lots of very small tiles.

elliottslaughter commented 3 years ago

@rupanshusoi and @mariodirenzo, it would help to get some very, very simple code samples (just a couple lines, like what we've been sharing up above) to describe what you're planning on doing. In particular, I'd like to come to an understanding of exactly how complicated the actual patterns you'll want to be running are.

In particular, a naive r[x+1] or similar just isn't going to work in general, because it doesn't take into account the boundary conditions. (It would be easy to do an out of bounds access if you're not careful.) So I'm thinking that HTR's case will need something more complicated, like:

for t in interior.tiles do
  for x in t do
    r[x].f = r[x+1].g + ... -- stencil access pattern is safe because `t` is a shrunk subset of a tile of `r`
  end
end

In particular, interior is r shrunk by the width of the stencil, so that the interior accesses are safe. (There would presumably have to be a separate exterior loop, but I'm not showing that here.) This is just more complicated for the compiler to handle because it's not immediately obvious that there's any relationship between interior and r. That's why I'd like to understand what exact patterns are required, because

elliottslaughter commented 3 years ago

@mariodirenzo Please clarify whether the use case described in #1057 is the motivation for requesting this feature, or whether you have any other use cases in addition to that. Thanks.

mariodirenzo commented 3 years ago

Please clarify whether the use case described in #1057 is the motivation for requesting this feature, or whether you have any other use cases in addition to that. Thanks.

I've also another use case, which is a periodic domain whose stencil calculation warp around its bounds. Though, I think that #1057 could address both use cases