StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
686 stars 145 forks source link

Support for Realm Graph Generation Operations in Legion #865

Open senanayake-reservoir opened 4 years ago

senanayake-reservoir commented 4 years ago

I would like to supply static dependences (based on the commit from @lightsighter) for all tasks and ideally place all execute_task calls out of order and specify incorrect permissions in the RegionRequirements (to make automatically generating Legion code simpler).

I was hoping that wrapping all my task launches in begin_static_trace/end_static_trace and setting TaskLauncher.static_dependences would accomplish this. I tried modifying the privileges and custom_mapper tutorials to work like this, but based on my runs I am not convinced that dependences from region requirements are being ignored so am afraid I might not be following the api correctly.

Using legion spy, it appears that the dataflow graph is being correctly constructed, but the event graph is incorrect and dependences seem to only be set by region requirements. I am currently using legion-18.09.0, but can try a newer version if something relevant has changed. Any advice is appreciated, thank you.

lightsighter commented 4 years ago

Without knowing more details, I think the crux of this issue will come down to this statement:

ideally place all execute_task calls out of order and specify incorrect permissions in the RegionRequirements (to make automatically generating Legion code simpler)

Legion doesn't know that the operations you are specifying are "out of order", it just sees the order that you specify them dynamically via calls into the runtime and assumes that those operations define the program order. All the tracing interfaces including the specification of static dependences inside of a trace are still designed around the concept of having this program order. The static tracing interface just lets you specify dependences between operations in that program order.

There isn't a way today to specify operations that are "out-of-order" to the runtime. I think it might be helpful to get some more information about what you're trying to do so that we can see if there is a way to accomplish what you're trying to do.

senanayake-reservoir commented 4 years ago

Thanks for the response, I can provide more context on what I am trying to accomplish.

I have a representation that already includes all statically-determined dependencies between tasks. I would like to generate legion code from this representation with a primary goal of reducing any overhead from the runtime needing to redetermine this dependency graph.

There is also a less important secondary goal of not requiring our compiler to determine the privileges required for each datablock that will yield the same dependency graph as this process can sometimes introduce additional false dependencies.

I am less familiar with Regent, but this seems like a functionality that it might also benefit from leveraging. We can continue to provide the operations in-order as required, but I also appreciate any advice on how to best provide Legion with this statically-determined information (and what type of benefit we might see from doing so).

lightsighter commented 4 years ago

We can continue to provide the operations in-order as required

I think this is probably going to continue to be a requirement for the time being. The runtime just doesn't have notion of "out-of-order" creation of tasks from the client program at the moment. It knows how to execute things out-of-order, but not to get them out-of-order. If we were going to do such a thing we would need a more absolute way of naming tasks (and other operations) so you can refer to things without the concept of relative ordering (e.g. like the relative dependence specifications for static tracing today). Unfortunately there's just no facilities for naming tasks like that in Legion today.

There is also a less important secondary goal of not requiring our compiler to determine the privileges required for each datablock that will yield the same dependency graph

I have a couple of questions as we have discussed a different Legion feature that might be pertinent to what you're trying to do. How sophisticated is your representation of the program? Specifically does it represent individual point tasks for each processor, or is it machine agnostic and thinks about programs like index space task launches abstractly without the index space being specified? If the former, have you made mapping decisions and know where each task is going to run, or do you leave that floating for a dynamic mapper that you've written? Lastly, do you have the full program represented statically, or just pieces of it that you know at compile time?

I am less familiar with Regent, but this seems like a functionality that it might also benefit from

Regent tends to use the dynamic tracing facilities in Legion as opposed to the fully static ones at the moment. I'm pretty sure we though that Regent has a strong notion of program order from what I know of it's current static analyses. I think it's transformations tend to keep things in program order. @elliottslaughter would have to say for sure.

senanayake-reservoir commented 4 years ago

The runtime just doesn't have notion of "out-of-order" creation of tasks from the client program at the moment.

This is something that we can easily do. I said out-of-order because I was hoping that we could fully specify the dependency graph rather than the runtime incurring the overhead to dynamically analyze the program order (and therefore the program order is unimportant for the behavior of the program). I see now that in Legion the program order is important for other things than just the dependencies, sorry for the confusion.

Specifically does it represent individual point tasks for each processor, or is it machine agnostic and thinks about programs like index space task launches abstractly without the index space being specified?

We start with a polyhedral representation of the individual operations, but then perform transformations so that we are left with a representation that has a DAG of task launches where each task launch has a corresponding index space, many operations, and datablocks. For shared memory task-based runtimes, we also allow the use of atomic operations to synchronize the launching of tasks with multiple dependencies (which avoids having a single main task, which potentially becomes a sequential bottleneck). For Legion, we currently constrain this DAG to a tree due to the region permissioning and use a main task, but would also be interested if the legion team has any thoughts on how to specify legion programs where the main task is a bottleneck.

If the former, have you made mapping decisions and know where each task is going to run, or do you leave that floating for a dynamic mapper that you've written?

We would like to allow the user or an autotuner to specify mapping decisions, but we plan to do this through Legion's mapper interface rather than doing all the mapping statically. We have not yet implemented this mapper so are open to other suggestions as well.

Lastly, do you have the full program represented statically, or just pieces of it that you know at compile time?

We represent all parts of the program that the user has specified is computationally expensive and therefore should be mapped through legion. There may be some external functions and glue code that is not represented.

More details about our approach can be found in our HPEC 19 paper: https://www.reservoir.com/wp-content/uploads/2019/09/Auto-Parallelization-to-Asynchronous-Task-Based-Runtimes.pdf

elliottslaughter commented 4 years ago

So reading through this thread, I really wonder if you should be using Realm directly. Realm is lower-level in some ways vs. what you need (i.e. you'll have to do the mapping yourself) but in other ways it looks like a good fit. You can specify the DAG of tasks directly, and data blocks do not need to be exactly tree-shaped. It will also be lower-overhead than Legion.

senanayake-reservoir commented 4 years ago

We looked at targeting Realm directly and agree with your assessment that it may be a better fit, but we need to target Legion so that we can integrate with other Legion code and so that the generated code can be understood and modified by those familiar with writing Legion code. This is why tracing allowing the selective specification of dependencies within a set of operations was of particular interest.

lightsighter commented 4 years ago

My questions were trying to figure out what the best approach was for you guys to follow. I think there are really three options, two of which you could follow today as-is and one of which would require a new feature in Legion that we have discussed but haven't implemented yet. Which one you pick I think is a function of how you want to think about expressing programs and mapping them. Here are my thoughts on the three approaches:

  1. You can continue doing everything in Legion. I think this is good if you're just doing basic parallelization of individual loop nests into index space launches (e.g. like Regent's auto-parallelizer) and you know either none or some of the dependences between loop nests, or alternatively there is "glue" code you don't want to analyze between the loop nests. Finally, this option will allow you to experiment with an auto-tuning mapper.

  2. As @elliottslaughter suggested, you could also imagine targeting Realm directly. This is probably the best option if you are doing whole program optimization and you can analyze literally all the dependences in the program, so there really is no need for any dependence analysis to be done by Legion. You can probably build something analogous to a "static mapping interface" to control the placement onto real machines and explore different mapping strategies for program representations without too much trouble.

  3. Lastly, as a hybrid approach, we have discussed at various points adding a feature to Legion where you could launch an operation that simply created its own graph of Realm operations to be stitched into the Realm graph of operations being generated by Legion. This would allow you to control the creation of scheduling and mapping at the Realm level for loop nests that your polyhedral compiler can analyze, and then let Legion deal with all the glue code and things that can't be statically analyzed. We don't have support at this for the moment, but we know some parts of how it would work and have just been waiting for a driving example to motivate the remaining design and implementation.

Let us know which of those three paths you might be interested in pursuing and we can help guide you along.

senanayake-reservoir commented 4 years ago

Thank you we really appreciate your help, this list was very helpful! After discussing with the team, we are indeed very interested in option 3 and believe that we could effectively target it, once support is added in Legion. We would be happy to discuss further with you to try to share thoughts on possible driving examples, if that would be helpful for you.

We will plan on continuing with option 1 until such support is added. Within option 1, do you have any suggestions for how we can reduce runtime overhead by utilizing some of the analysis we have already done?

lightsighter commented 4 years ago

Within option 1, do you have any suggestions for how we can reduce runtime overhead by utilizing some of the analysis we have already done?

I'm not sure that there's anything you can do with option 1 at the moment due to our limited API for expressing dependences, but one thing I can think of is that we can add support for our dynamic event graph capture to work with static tracing of the logical dependence analysis as well so that we can capture the resulting Realm event graphs and replay them more efficiently. I think this should be straight forward, but let's get @magnatelee's opinion as well. This should give you a considerable performance improvement whenever you are strong scaling with Legion and your computations are executing repeatedly.

After discussing with the team, we are indeed very interested in option 3 and believe that we could effectively target it, once support is added in Legion.

With your permission I'm going to rename this issue and turn it into the feature request for adding support for these kinds of "Realm operations" to Legion's API. The rough interface for these operations is that they will look a little bit like task launches in that they will pass region requirements to Legion for the data that they are going to access and their regions will be mapped just like a task and we'll invoke some kind of function that you register with Legion to construct the Realm graph. However, unlike a task, Legion will give you access to the Realm events corresponding to when the input data is ready, and then you will hand back events for when the output data is ready. In preparation for this, it might be good for you guys to start looking at the Realm API and figuring out how to have your compiler generate a program that will construct a Realm graph. If you like I can provide pointers for how to get started with this. You should also know that you're not the first group to go down this path. There are groups at Amazon and NVIDIA that are both targeting Realm directly now as well.

magnatelee commented 4 years ago

I also think that the dynamic event graph capture can be made working with static tracing pretty easily.

senanayake-reservoir commented 4 years ago

Ok thank you for this information, we will continue looking more at the Realm API. I believe we have a fairly straightforward path towards embedding a Realm graph within a legion program.

I agree with modifying the issue and also very much appreciate any updates (on both "Realm operations" and the modified dynamic event graph capture) that can be provided along the way so that we have a chance to plan accordingly as both of these features are of interest.

lightsighter commented 4 years ago

I've consolidated the tracing APIs behind the single begin_trace/end_trace calls so that both static and dynamic traces can capture Realm event graphs and replay them as directed by the mapper. First, you'll notice that the existing begin_static_trace/end_static_trace calls are deprecated: https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/legion.h#L6588-6610 Next, you'll see that begin_trace now has new calls to it to say whether it is a static trace or not and, if so, which region trees are being handled statically: https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/legion.h#L6553-6583 You should be able to use that for now to create static traces which can also be memoized by the mapper.

I'm now renaming this issue to describe our planned future enhancement which is to add support for Realm graph construction operations inside of Legion.

BlakeGerard commented 3 years ago

Hello. I was hoping to follow up on the status of this issue, and to check that my current understanding of the existing API for expressing static dependencies is complete.

Operating under the same representation as @senanayake-reservoir has described above, I am attempting to supply static dependence information to Legion through the begin_trace/end_trace calls with the static_trace=true, and the TaskLauncher.static_dependences field. All tasks are launched from one main task. As of now, all tasks hold region requirements with exclusive coherence over their data dependent regions. However, through experimentation it seems that we should be specifying relaxed coherence over data dependence regions in order for the static_dependences to be observed, forgoing the need for dynamic dependence analysis. Are these three components sufficient for the Legion runtime to solely rely on the static dependence information provided? If not, would you please provide some direction as to the correct program structure/API calls needed to fully utilize this static tracing feature? A sample program or test would be useful for comparison, if available.

For example, I attempted to modify the privilege tutorial code by inserting begin_trace/end_trace delimiters around the task launches in the top level task, and modeled the Legion Spy-generated event dependence graph by manually specifying the StaticDependence objects to reconstruct the appropriate edges. I did this with and without relaxed coherence on all region requirements, but no specification of DependenceType seemed to affect the outcome of the program. However, if I specified dependences between the init tasks and the daxpy task, and did not specify a dependence between the daxpy and check task, then the runtime would inform me that the check task was operating on uninitialized data, seemingly communicating that the check and daxpy tasks were scheduled to run simultaneously.

In the practical kernel we are looking at, if we enforce exclusive coherence over each task’s data blocks while also supplying StaticDependences within a trace, it appears that the dynamic data dependence makes the static dependences redundant. When I specify relaxed coherence on all data blocks and include StaticDependences, then we see a relevant speedup. Is this an indication that the runtime is taking advantage of StaticDependences?

Thanks!

lightsighter commented 3 years ago

However, through experimentation it seems that we should be specifying relaxed coherence over data dependence regions in order for the static_dependences to be observed, forgoing the need for dynamic dependence analysis.

The use of relaxed coherence shouldn't matter at all for how you specify static dependences. The two concepts are orthogonal. If you choose to specify static dependences, then the runtime is only going to abide by the dependences that you specify. That is true regardless of the coherence modes.

Are these three components sufficient for the Legion runtime to solely rely on the static dependence information provided?

All you need to do for static dependences is have the begin/end static trace calls and to fill in all the static dependence fields for the tasks that fall in that trace. Nothing else is required.

A sample program or test would be useful for comparison, if available.

There was one about 5 years ago when this feature was developed. It bit-rotted and was removed from the repository. Most users at the moment find that they get better performance using dynamic tracing than static tracing ever provided because static tracing was only tracing the logical dependences in the program, whereas the most expensive part of the analysis is in the physical analysis which is not amortized by static tracing. The feature extension to allow users to express Realm event graphs for parts of the program would actually address this.

However, if I specified dependences between the init tasks and the daxpy task, and did not specify a dependence between the daxpy and check task, then the runtime would inform me that the check task was operating on uninitialized data, seemingly communicating that the check and daxpy tasks were scheduled to run simultaneously.

This is indicative of your code missing dependences that are essential for the correctness of the code. If you build Legion with detailed support for Legion Spy and then produce a log from a run, then you can use the -l flag to verify that all the needed logical dependences are present. Legion Spy will report the ones that you are missing.

When I specify relaxed coherence on all data blocks and include StaticDependences, then we see a relevant speedup. Is this an indication that the runtime is taking advantage of StaticDependences?

Without a profile, it is impossible to say why your program is or is not speeding up. I suspect that the speedup you are getting is due to changes in the physical analysis (likely unsound ones since relaxed coherence in the physical analysis is probably not what you actually want if your dependences are actually need to be exclusive). I doubt that your use of static tracing is actually having much bearing on performance unless you look very closely at the profile and observe noticeable reductions in the size of runtime meta-tasks for doing logical analysis.

BlakeGerard commented 3 years ago

Most users at the moment find that they get better performance using dynamic tracing than static tracing ever provided because static tracing was only tracing the logical dependences in the program, whereas the most expensive part of the analysis is in the physical analysis which is not amortized by static tracing. The feature extension to allow users to express Realm event graphs for parts of the program would actually address this.

Thank you for this. We will keep exploring the benefits of dynamic tracing and comparing against static tracing.