StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
678 stars 145 forks source link

Limitations of the Legion Prof+Spy integration #1328

Open elliottslaughter opened 2 years ago

elliottslaughter commented 2 years ago

I've been digging into the Legion Prof+Spy integration so that I can replicate it in Rust, and in the process have found some (at least to me) surprising limitations. I figure it's worth documenting these so that others are aware—and it may be worth having a discussion about whether it may be worth lifting any of these (or not).

Dependencies in Legion Prof are tracked only for tasks and meta tasks. Though Legion Prof uses Legion Spy to compute a full event graph, Legion Prof interacts with the graph in a way that extracts only the tasks (and to a limited degree, meta tasks).

This means that dependencies in Legion Prof do not track copies, or fills, or even a variety of operations on processors like runtime tasks or mapper calls.

Here is a perfect example of how the current code is too limited to even represent dependencies on operations that do not live on a processor. Note how the coding of a dependence "node" includes the processor ID:

https://github.com/StanfordLegion/legion/blob/2370b143b701fc714a99da6a2c7ab98dc0784d0a/tools/legion_prof.py#L1309

The case of meta tasks is particularly hacky. Legion Prof is able to associate a meta tasks via an initiation_op to a task. But Legion Prof has no concept of dependencies on meta tasks. Instead, Legion Prof compares the stop times of the task vs each associated meta task to determine which one "goes first".

https://github.com/StanfordLegion/legion/blob/2370b143b701fc714a99da6a2c7ab98dc0784d0a/tools/legion_prof.py#L437-L440

Notably, this code attaches the meta task only to the corresponding task and not to any other operation in the event graph. Therefore, we cannot trace dependencies through meta tasks in a meaningful sense.

A result of all of this is that the "critical path" being computed by Legion Prof at the moment is at best a critical path considering only application tasks. It does not consider meta tasks or copies, etc. This seems like a pretty big omission right now because many use cases become bottlenecked on these resources.

At the moment I'm trying to replicate this logic in Rust, but there is a question how useful it'll even be when I complete it, given the state this is in.

eddy16112 commented 2 years ago

The case of meta tasks is particularly hacky. Legion Prof is able to associate a meta tasks via an initiation_op to a task. But Legion Prof has no concept of dependencies on meta tasks. Instead, Legion Prof compares the stop times of the task vs each associated meta task to determine which one "goes first".

I do not quite understand this. https://github.com/StanfordLegion/legion/blob/2370b143b701fc714a99da6a2c7ab98dc0784d0a/tools/legion_prof.py#L424 self can also be MapperCall, Copy, etc.

lightsighter commented 2 years ago

A result of all of this is that the "critical path" being computed by Legion Prof at the moment is at best a critical path considering only application tasks. It does not consider meta tasks or copies, etc. This seems like a pretty big omission right now because many use cases become bottlenecked on these resources.

I agree that the critical path analysis should definitely include copies, fills, depparts, and other application-generated Realm operations as part of its computation. I'm more skeptical that we should include the meta-tasks in the critical path analysis (at least for now). There's not much mappers can do to improve those meta-tasks, whereas there's plenty that mappers can do to alter the shape of the Realm event graph for the actual application tasks/copies/fills/depparts. It also has the benefit of being simpler: for completeness (I've mentioned this to @elliottslaughter already), the meta-task dependences won't look like a normal Realm event graph because they're effectively implementing a continuation-passing-style semantics, so dependences are implicit: one task depends on a prior one if the earlier one launched the later one. It's not that you can't analyze that, but it's not going to be as straight-forward as just looking at the Realm event graph for the application and traversing it to look for the critical path.

At the moment I'm trying to replicate this logic in Rust, but there is a question how useful it'll even be when I complete it, given the state this is in.

I would argue for not duplicating it, but just doing the basic critical path analysis on the application graph (including copies/fills/depparts) and ignoring meta-tasks.