Open lightsighter opened 3 months ago
This approach has the benefit of incurring the least amount of overhead when profiling and doing the most analysis offline, but it is prone to running out of memory on larger runs where we need to load lots of logs upfront to be able to do the analysis.
Just to be clear, the OOM we are getting is because of the profiler and not because of storing/grabbing profiling info from the reduction results?
It would be good if Realm could provide support for barrier profiling. In particular it would be good to know the following information:
I did a crude attempt to get some of this information while manually profiling barriers a while ago and agree that we should probably have an "automated" way for that. Unclear at which point we are going to want "the profiling support" given that you already have a solution that gets you to a certain point. However, considering that we have an implementation to scale barrier's arrivals/broadcast with p2p active messages...perhaps we may want to have the profiling support for that first before we move forwards with it.
Users can request a profiling response for a specific barrier generation of the barrier at any time on any node
What if the barrier on other node has already passed the generation when we request the profiling response on our node?
Just to be clear, the OOM we are getting is because of the profiler and not because of storing/grabbing profiling info from the reduction results?
The OOM is occurring during post-processing of the logfiles by Legion Prof and not during the execution of the program. The problem is that the size of the graph needed to represent the Realm event graph is too big to fit in memory.
Unclear at which point we are going to want "the profiling support" given that you already have a solution that gets you to a certain point.
Right, I have a work-around for now which relies on the barrier reduction mechanism.
perhaps we may want to have the profiling support for that first before we move forwards with it.
I would actually probably prefer that we get that done first and then maybe add this barrier profiling support on top of that once it is ready, especially since we already have a work-around for the moment (assuming the work-around continues to work at scale).
What if the barrier on other node has already passed the generation when we request the profiling response on our node?
I'm assuming that the implementation will store the profiling responses for all the generations indefinitely, similar to how it stores the reduction results of the barrier indefinitely. Yes this is inefficient, but it's something the user opts into with an understanding of the costs, similar to how they opt into using a reduction operator with a barrier.
@eddy16112
In the process of adding critical path analysis support in Legion, it's become apparent that barriers are very difficult to profile in a scalable way. With critical path support, Legion currently supports two ways of profiling barriers:
-lg:prof_all_critical_arrivals
flag on the command line to opt for this version. This approach has the benefit of incurring the least amount of overhead when profiling and doing the most analysis offline, but it is prone to running out of memory on larger runs where we need to load lots of logs upfront to be able to do the analysis.It would be good if Realm could provide support for barrier profiling. In particular it would be good to know the following information:
One thing that will be hard about this is defining a model for profiling responses that allows users to request it in a way that allows them to control how many profiling responses they get and on which nodes. I suspect the following model might be a good one:
Assigning to @apryakhin for triaging and delegation.