StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
674 stars 146 forks source link

Tracking Dependences on Non-Region Resources #915

Open ipdemes opened 4 years ago

ipdemes commented 4 years ago

We would like to be able to set dependency between a task that creates a partition and a projection functor for a subsequent task that selects (a subspace of) it. @lightsighter : could you, please, add here 3 different options you proposed during the meeting we had today

lightsighter commented 4 years ago

Legion doesn't track dependences on the creation and mutation of resources like index space trees and field spaces. This can create problems with deferred execution when applications make a new resource like a partition in one sub-task and expect it to exist in another sub-task with no other kinds of dependences (futures, regions, execution fences) between them.

There are three potential solutions that I can think of at the moment:

  1. We can design a whole system of allocation/mutation privileges for index space trees and field spaces to ensure such dependences are captured by the runtime system. I'm not sure of what the semantics of this system would be, but I have a few rough ideas. It would probably require a formal treatment before I would be comfortable implementing it.
  2. We could resurrect the old index space and field space requirements that we used to have on task launches and require applications to make all there dependences about resources like index spaces and field spaces explicit to the runtime.
  3. We could modify the runtime calls to get access to things like partitions through colors to have an optional parameter to blocking waiting for the result until the application task that produces them actually runs. There are two risks to this approach: first, if the application makes a mistake and never actually produces the resource, the application will hang, and second, there is a risk of resource deadlock where the task waiting for the resource is consuming resources that the producer task needs to run.

I'm not going to bias the discussion with my personal preference yet. I want to hear what others think.

@ipdemes to comment on FleCSI @syamajala to comment on S3D since he's complained about this before @magnatelee to comment on Legate uses cases which he has also complained about @elliottslaughter to comment on which approaches are tractable for Regent

elliottslaughter commented 4 years ago

I'm not sure how (1) and (2) are different.

Regent used to do (2) (and may still? I'm not sure), so that's definitely tractable. It's not pretty but is at least consistent with the rest of the model. And most users can probably forget about it since usually these will be dominated by region requirements.

lightsighter commented 4 years ago

(1) and (2) are differentiated by whether they behave more like region privileges or like futures with our current data dependences. So (1) would work more like region privileges where you would have to declare privileges on parts of index spaces trees or on field spaces to be able to modify them as well as to read them. Those privileges would of course apply transitively down the trees like with regions. With (2), things would work more like futures where you would have to pass explicit names for all the kinds of dependences that you would want to track. There would be no "transitive analysis" performed by the runtime and everything would be explicitly done with names. That is a little hand-wavy because I don't have clear semantics in my head for them, but that should give you a feel for the semantic difference between the approaches.

Regent used to do (2) (and may still? I'm not sure)

Pretty sure that it does, since I tried removing them at one point and all kinds of things broke.

And most users can probably forget about it since usually these will be dominated by region requirements.

I agree this is mostly true, but clearly people are creating examples where it is not true anymore.

ipdemes commented 4 years ago

(3) would be easier for us to adapt, but we are fine with other two options as well.

lightsighter commented 4 years ago

The risk of resource deadlock that can occur with the (3) is pretty concerning for me. Especially given all the problems that @magnatelee is trying to fix with it right now. I want to make sure we don't introduce additional cases that can contribute to resource deadlock. I'm open to the group saying that they are willing to deal with it, but it has to be pretty much unanimous for me to accept (3) I think.

opensdh commented 4 years ago

Since we're interested in "resizing" partitions by destroying and recreating them, we also want to establish a subsequent dependency between using a partition (i.e., an ordinary task operating on one of its fields) and destroying it—and reusing its color. The former may be automatic in that a call to destroy_index_partition will merely queue the deletion until no task is using it, but that then opens the possibility for the latter conflict. Using AUTO_GENERATE_ID doesn't seem to work to avoid that: since we would use it on every shard concurrently, we would need to use a field to store the colors chosen, and that can't be given to the projection functor. (We could try to use a global variable instead, but aside from the new synchronization problems that would introduce I don't think we can control on which shard each call to the projection functor is made.)

elliottslaughter commented 4 years ago

@opensdh Just FYI, you can attach "projection args" to region requirements, which could e.g. contain colors. So no global variables are necessary to pass colors into projection functors. Personally, that would be the approach I would use rather than attempting to destroy and re-create the same color over and over.

https://github.com/StanfordLegion/legion/blob/master/runtime/legion.h#L1048

opensdh commented 4 years ago

Just FYI, you can attach "projection args" to region requirements, which could e.g. contain colors.

Thanks, but how would that allow passing a different such argument to each invocation of the projection functor?

Personally, that would be the approach I would use rather than attempting to destroy and re-create the same color over and over.

Or are you suggesting that we would neither reuse colors nor use AUTO_GENERATE_ID by having every shard's subpartition color advanced through some predefined sequence in lockstep?

elliottslaughter commented 4 years ago

Using the method I linked above, you can attach a different set of args to each region requirement. That means every time you execute a new task or index launch, you can potentially include a new set of args. So if you are e.g. running a time step loop and you use a different color partition on every iteration through the loop, the iteration at timestep t can include the color for the partition at timestep t in the projection args for the region requirement.

Does that make sense?

opensdh commented 4 years ago

Of course each launch (and each requirement for that launch) can have a different argument. The problem here is that, without any control over the per-shard colors chosen by AUTO_GENERATE_ID, we would need to provide an argument per index point of the launch. We could of course pass one large argument to all the copies that contained all the colors they need, letting each invocation select the color appropriate to its index point, but that puts us back at the point of having to do an all-gather (with its N2 total workload) every time the layout changes (as if we were using create_partition_by_domain with no projection functor at all).

elliottslaughter commented 4 years ago

I guess I'm confused. DCR requires every shard to do the same work (i.e., do the same partition calls, with the same values). If you're not doing this, you're not doing valid DCR and your program will already be broken.

But if your program is valid for DCR, then that means every shard has all the colors. Now admittedly, AUTO_GENERATE_ID may choose different colors for every child partition. I realize that may inflate the size of the argument that must be attached. But it's not a matter of communicating that information, just a matter of allocating and storing it locally. (Each shard will presumably use its local copy of the projection args and not require communication.)

But anyway, if you want to ensure an O(1) size for projection args then yes you will need a scheme for choosing colors so that the are all the same. An incrementing counter ought to do it. You may get lucky too with AUTO_GENERATE_ID but that will be inherently more brittle so it's probably best not to rely on it.

opensdh commented 4 years ago

The various calls to create the subpartitions were made in a prior index launch, so they had the means to differ despite DCR rules—and the (replicable task) caller had no way of learning the colors except via examining a field later (return values in our case being reserved for the user). (I also note that the incrementing counter would require its own barriers to make sure the correct value was read by potentially concurrent projection functor calls.)

elliottslaughter commented 4 years ago

When you say that you are creating partitions inside an index launch, does that mean that each point task within the launch creates at most one partition (with a different subregion as the parent), or are you potentially creating more than one partition per point task?

One option is to have the main task choose the colors in advance and pass them down to the child task. Even if the child task doesn't necessarily allocate a partition in all cases, that's not necessarily a problem as long as there is at most one partition created per child task. That would both keep the colors in sync and also ensure that main knows what those colors are.

Otherwise perhaps you should describe more about what you're doing since I'm still not fully sure I understand.

opensdh commented 4 years ago

Yes, each point task creates exactly one partition, in a different subspace from any other point task, which itself has but one subspace. Certainly we could supply the common color to use everywhere as a task argument (rather than having the task read a global), and that could reduce the number of possible synchronization issues. I don't morally like having a forever-incrementing counter because it's intrinsically incorrect (even if in a fashion that has no practical importance); if I can bound the number of concurrently-used colors, it shouldn't be hard to limit it to 1…

lightsighter commented 4 years ago

I don't morally like having a forever-incrementing counter because it's intrinsically incorrect (even if in a fashion that has no practical importance)

A 5 GHz processor incrementing a counter once every cycle will take decades to count to 2^64 (or 2^63 if you're using signed integers). If you're running tasks in between the creation of these partitions, it should take you a few hundred years to overflow that counter, and if you're using an unsigned type, then everything will still be well defined. The amount of memory you would need to keep around O(2^64) tasks in Legion is probably several exabytes of data, so you'll run out of memory before you hit any problems with wrap-around.

opensdh commented 4 years ago

A 5 GHz processor incrementing a counter once every cycle will take decades to count to 2^64 (or 2^63 if you're using signed integers).

Indeed; I didn't provide such a calculation because I didn't think "has no practical importance" needed one.

if you're using an unsigned type, then everything will still be well defined.

That's not actually true: if (in this fantastic scenario where the machine runs for decades) it wraps around, then without the dependencies in question in this issue, there's no guarantee that Legion ever got around to releasing, say, color #⁠5 even if it freed all the others created since (to avoid the EB of RAM problem).

Obviously that's not a "real problem" either, but it's unfortunate to have to prune the space of considerations of the careful programmer (without losing track of rather more plausible issues like reading the counter on the wrong side of the increment).

lightsighter commented 2 years ago

We have a partial solution to this after a discussion in the Legion meeting. In the case of queries from the application, the runtime can insert "upward facing" execution fences into the stream of operations to wait until all outstanding execution has finished and ensure that the runtime isn't observing an invalid state of the index/region trees.

There are still two outstanding problems after adopting this approach:

  1. We need to make sure any deletion effects have propagated through the tree to ensure that we can't observe any parts of the index/region trees that might be deleted before we reach out point in program execution. There's no obvious way to guarantee that all deletions have propagated up the region tree other than introducing full execution fences and waiting for them, which seems excessively expensive.
  2. We need to figure out what to do about queries that originate outside the application, specifically this includes things like projection functors and mappers. Since these queries occur outside the application, there's no obvious way to insert synchronization into the stream of operations in a way consistent with sequential execution, so we need to think about how to resolve such cases.