StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
675 stars 145 forks source link

Decoupled Index Space Names and Bounds #707

Closed lightsighter closed 4 years ago

lightsighter commented 4 years ago

In order to deal with tasks that produce unknown sized output, it would be nice if we could create the names for index spaces and logical regions without having to specify their size up front and then later bind the size as part of a task when it is mapped either inline or as a sub-task. The idea would be that index spaces and their associate regions do not actually need bounds until they are mapped for the first time. Doing this could potentially create cycles and therefore is not something that we should take lightly. Assigning @alexaiken to think through an initial design.

magnatelee commented 4 years ago

To make the discussion concrete, let me describe a typical scenario that could be greatly simplified with deferred binding of bounds to regions:

Launching Legion tasks that potentially produce outputs of different sizes involves the following steps:

  1. Computing (or estimating the upper bound of) the size of the output for each parallel task. We often need to launch another set of tasks to do this counting if the counting needs to access data. (e.g., filtering values with a predicate)
  2. Determining bounds of output subregions. This requires a prefix sum of the result from (1)
  3. Creating an output partition using the bounds from (2)
  4. Launching computation tasks in parallel using the partition from (3)

The first three steps are just to set up the output partition and the amount of computation involved is often very light. Furthermore, there can be duplication of work in steps (1) and (4) because counting in (3) often cannot be done without performing some or even all computation in (4).

With proper runtime support, steps (1)-(4) can be simplified to the following two:

a. Launching computation tasks that take subregions with unspecified bounds. Each task allocates elements incrementally in its output subregion. When the tasks are done, each subregion is only locally consistent, i.e., has 0 for its lower bound. b. Shuffling subregion bounds across all tasks to recover global indexing. This is a prefix sum as in step (2) but performed inside the runtime.

This simpler strategy requires more than just deferred binding of bounds. Here are relevant runtime mechanisms:

i. Deferred binding of bounds to index spaces (this issue) ii. Regions that have index spaces with unspecified bounds. These regions are mapped to a special kind of Realm instances in (iii) and provide allocators (similar to those that we used to have for unstructured regions a long time ago). iii. Realm instances providing malloc-like allocators. Region allocators in (ii) will be backed by these Realm allocators. iv. Pending partitions of index spaces having unspecified bounds. Subspaces of this pending partition also have unspecified bounds. v. A prefix sum or something to that effect to recover global indexing as in (b). This is necessary only when there are parallel tasks producing outputs with the pending partition in (iv).

One of the things to be discussed is when we should recover global indexing (step (b)). One way would be to simply let the user initiate the recovery. Another option would be to couple it with the task life cycle, for which we need these simplifying assumptions:

Obviously, I left out a lot of details here, so any comments or questions are welcome.

lightsighter commented 4 years ago

@streichler Here is the problem I was thinking of:

t2 maps out of order with respect to t1 (no region or future dependences), t2 grabs resources t1 needs to run, deadlock. I don't think we want to do dependence analysis on all index spaces and subspaces needed by any given task to look for implicit dependences on prior operation futures, so we need a better solution.

lightsighter commented 4 years ago

Ok, the plan is to implement this and enforce mapping dependences similar to how we do it for dependent partitioning operators today. All operations that come after the operation that produces the future for the index space shape will record a mapping dependent on that prior operation. This is sound but imprecise logical analysis. Physical analysis though will still be precise and will elide any unnecessary dependences.

lightsighter commented 4 years ago

Adding a second aspect to this: create fields from futures that describe the sizes of the fields.

lightsighter commented 4 years ago

We finished the initial work on deferred index space creation from futures, but did not get to support for deferred field creation from futures. Moving this to the 20.06 milestone.

lightsighter commented 4 years ago

I believe that both index space allocation as well as field allocation from future values are now supported in both the master and control_replication branches. Assigning to @magnatelee to confirm that they are working as expected.

magnatelee commented 4 years ago

Index space allocation from future values has been working as I expected. I actually don't have a use case for field allocation from future values at the moment, but a unit test seems to be working fine. I'll close this issue and will open a new one if I find any bug in either of them.