StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
679 stars 145 forks source link

legion_prof crash #1546

Open syamajala opened 1 year ago

syamajala commented 1 year ago

I am seeing a crash with legion_prof:

(/pscratch/sd/s/seshu/cunumeric-cuda) seshu@perlmutter:login06:/pscratch/sd/s/seshu/2023/09/07/095158> legion_prof legate_*.prof
Reading log file "legate_0.prof"...
Reading log file "legate_2.prof"...
Reading log file "legate_3.prof"...
Reading log file "legate_1.prof"...
Matched 55611 objects
Matched 49128 objects
Matched 48719 objects
Matched 49847 objects
No Legion Spy data, skipping postprocess step
Sorting time ranges
Generating interactive visualization files in directory "legion_prof"
thread '<unnamed>' panicked at 'Error: Copy can not find src_inst:0x8000004000500037', src/backend/common.rs:915:21
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
(/pscratch/sd/s/seshu/cunumeric-cuda) seshu@perlmutter:login06:/pscratch/sd/s/seshu/2023/09/07/095158> legion_prof --archive legate_*.prof
Reading log file "legate_0.prof"...
Reading log file "legate_2.prof"...
Reading log file "legate_1.prof"...
Reading log file "legate_3.prof"...
Matched 55611 objects
Matched 49128 objects
Matched 48719 objects
Matched 49847 objects
No Legion Spy data, skipping postprocess step
Sorting time ranges
Created output directory "legion_prof.1"
Writing level 0 with 1 tiles
thread '<unnamed>' panicked at 'assertion failed: sample_util <= 1.0', src/backend/data_source.rs:600:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread '<unnamed>' panicked at 'assertion failed: sample_util <= 1.0', src/backend/data_source.rs:600:13
Rayon: detected unexpected panic; aborting
Rayon: detected unexpected panic; aborting
Aborted

These are the two modes I care about right now.

The logs are here: http://sapling2.stanford.edu/~seshu/rixs/logs/

I should be using this commit of legion:

commit 3fbed8ed8771f931e463400390edcd9484fa8537 (HEAD -> shardrefine, origin/shardrefine)
Author: Mike Bauer <mike@lightsighter.org>
Date:   Tue Sep 5 00:48:39 2023 -0700

    legion: remove advisement operations since they are no longer meaningful with the new refinement data structures and replace them with reset operations
eddy16112 commented 1 year ago

I played with the log files and it turned out the instance 0x8000004000500037 is not logged, that's why the copy complains that there is no such instance 0x8000004000500037. Here is the CopyInstInfo that contains this instance:

src_mem: 0x1e0000000000000a  dst_mem: 0x1e0001000000000a  src_fid: 0  dst_fid: 0  src_inst: 0x8000004000500037  dst_inst: 0x8000801000a00050  fevent: 0x800080100de00018  num_hops: 1  indirect: False

I think we need @lightsighter to take a look at this issue.

lightsighter commented 1 year ago

Run with the latest control replication first.

syamajala commented 1 year ago

With the latest control replication I still see a crash with archive mode of legion_prof, but the default mode works.

Created output directory "legion_prof.1"
Writing level 0 with 1 tiles
thread '<unnamed>' panicked at 'assertion failed: sample_util <= 1.0', src/backend/data_source.rs:600:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread '<unnamed>' panicked at 'assertion failed: sample_util <= 1.0', src/backend/data_source.rs:600:13
Rayon: detected unexpected panic; aborting
Rayon: detected unexpected panic; aborting
Aborted

I think @elliottslaughter needs to look at this. The logs are here: http://sapling2.stanford.edu/~seshu/rixs/logs/

syamajala commented 1 year ago

I did another run and I hit the missing copy error again too on control_replication.

commit 61a919f8b2dce55619b6318de19b63aa5c1c56d9 (HEAD -> control_replication, origin/control_replication)
Author: Mike <mebauer@cs.stanford.edu>
Date:   Wed Sep 6 17:02:41 2023 -0700

    legion: make sure to clear references to the initial future when all-reduce operations are deactivated

Not sure if they are useful but profile logs are available here: http://sapling2.stanford.edu/~seshu/rixs/logs/missing_copy/

lightsighter commented 1 year ago

Are we sure that we are doing all the detach operations on external allocations? Any instances that Legion allocates will require those instances be deleted and have their profiling responses come back to the runtime before it can shutdown or it will hang. All that leaves left are the instances made for external allocations with attach/detach.

If you want me to look at it I will need a reproducer on sapling.

elliottslaughter commented 1 year ago

I think I found a root cause for https://github.com/StanfordLegion/legion/issues/1546#issuecomment-1711981596, but there is a design decision that needs to be made here.

When you have a mix of dynamically and statically-sized memories, the profiler currently takes the sum of the static memories to compute an upper bound. (This is for the sysmem overview across all nodes.) That is obviously wrong. However, what is the correct bound?

At the point we are in the code, it would be easy to compute a dynamic bound on all memories. However, that may be inaccurate because the static memories aren't necessarily fully used.

I can't see an easy way to split it out (e.g., sum all static memory sizes and then add the dynamic memory allocations on top of that). If we really want that as an answer, I'll have to re-architect the code to achieve it.

Edit: Another alternative would be to take the max of the static and dynamic capacities. That's a bit odd, but capacity for dynamic memories is odd anyway.

Edit 2: One more option is to split out the utilization overview for static vs dynamic memories. After all, "dynamic" memories are really for external instances, and arguably they're basically a different memory kind. That might give users a more accurate view of where memory is going, and it certainly makes it easier to calculate usage in the profiler. (To be clear, this is entirely a profiler-backend change. No changes in Legion are implicated here.)

Edit 3: @eddy16112 is asking if we need overviews for memories at all. I personally think they're useful, but one option is to remove them.

lightsighter commented 1 year ago

@syamajala Apply this patch to Legion and then try to generate the profile logs again. When it crashes report the backtrace with line numbers:

diff --git a/runtime/legion/runtime.cc b/runtime/legion/runtime.cc
index 00db22fb5..d221d793d 100644
--- a/runtime/legion/runtime.cc
+++ b/runtime/legion/runtime.cc
@@ -8252,6 +8252,13 @@ namespace Legion {
                                                              constraints,
                                                              dim_order);
       Realm::ProfilingRequestSet no_requests;
+      if (runtime->profiler != NULL)
+      {
+        const RtUserEvent unique = Runtime::create_rt_user_event();
+        Runtime::trigger_event(unique);
+        runtime->profiler->add_inst_request(no_requests,
+            implicit_provenance, unique);
+      }
       Realm::RegionInstance::create_instance(eager_pool_instance,
                                              memory,
                                              layout,
@@ -8272,9 +8279,7 @@ namespace Legion {
     //--------------------------------------------------------------------------
     MemoryManager::~MemoryManager(void)
     //--------------------------------------------------------------------------
-    {
-      if (eager_pool_instance.exists())
-        eager_pool_instance.destroy();
+    { 
       delete eager_allocator;
     }

@@ -8413,6 +8418,8 @@ namespace Legion {
         free_legion_instance(it->first, it->second);
       pending_collectables.clear();
 #endif
+      if (eager_pool_instance.exists())
+        eager_pool_instance.destroy();
     }

     //--------------------------------------------------------------------------
diff --git a/runtime/realm/inst_impl.cc b/runtime/realm/inst_impl.cc
index 33f959907..c3665d9ec 100644
--- a/runtime/realm/inst_impl.cc
+++ b/runtime/realm/inst_impl.cc
@@ -435,6 +435,7 @@ namespace Realm {
                                                     const ProfilingRequestSet& prs,
                                                     Event wait_on)
     {
+      assert(!prs.empty());
       return RegionInstanceImpl::create_instance(inst, memory, ilg, 0,
                                                 prs, wait_on);
     }
@@ -446,6 +447,7 @@ namespace Realm {
                                                              const ProfilingRequestSet& prs,
                                                              Event wait_on)
     {
+      assert(!prs.empty());
       return RegionInstanceImpl::create_instance(inst, memory, ilg, &res,
                                                 prs, wait_on);
     }
@@ -777,6 +779,7 @@ namespace Realm {
                                                         const ProfilingRequestSet& prs,
                                                         Event wait_on)
     {
+      assert(!prs.empty());
       MemoryImpl *m_impl = get_runtime()->get_memory_impl(memory);
       RegionInstanceImpl *impl = m_impl->new_instance();
       // we can fail to get a valid pointer if we are out of instance slots
lightsighter commented 1 year ago

At the point we are in the code, it would be easy to compute a dynamic bound on all memories. However, that may be inaccurate because the static memories aren't necessarily fully used.

@elliottslaughter I would compute the utilization of the static memories as normal based on their total size. For dynamic memories, I would compute their utilization based on the total size of all the live instances in them at any point. So 100% utilization of a dynamic memory is the point(s) where it hits its high water mark.

elliottslaughter commented 1 year ago

Maybe I'm not explaining myself clearly.

The profile provided by @syamajala has 4 nodes. It has 8 sysmems total. Of those, 4 (1 per node) are normal sysmem (what I called "statically sized" above, because it has a fixed size). Then there are 4 more external sysmems (what I called "dynamically sized", because the memory pool isn't fixed).

When we generate the utilization plot for a single memory, everything works. The logic in the profiler today is correct.

The problem is rendering the overview utilization plot. This covers the entire machine, so all 8 sysmems (both static and dynamic). Basically, to generate this utilization plot we pretend that this memory is a single aggregate memory across the entire machine, and we add up the sizes of the individual memories. Because the aggregate memory size is not 0 (it contains at least 1 static memory), it does not take the dynamic memory code path. So we successfully add up the sizes of the static memories, but then if there are enough dynamic allocations, that pushes us over.

You can see the problematic code here:

https://github.com/StanfordLegion/legion/blob/7b5ff2fb9974511c28aec8d97b942f26105b5f6d/tools/legion_prof_rs/src/backend/common.rs#L367-L376

To recap, I can see a few different options:

  1. Calculate the aggregate size as the static capacities + the maximum size of instances in dynamic memories. This is the theoretically optimal approach, but the code isn't designed to do it, and it would impose overheads in either computation or memory relative to what we do now.
  2. Pretend that the entire aggregate memory is dynamic and just calculate its size based on used instances across the entire machine. This is easy to do today but not accurate, because we throw away information about the sizes of the static memories.
  3. Take the max of the static size and dynamic size. A bit odd, but arguably a better approximation than (2).
  4. Split the overview into 2 parts. Now we'll have a "sysmem" and a "sysmem external" utilization plot in the overview. Arguably might be better than (1) since users might care about this detail.
  5. Stop providing overview utilization for memories for the entire machine. The problem goes away, at the cost of this feature.
lightsighter commented 1 year ago

Option (1) was exactly what I was proposing. You can decide if you think it is worth it or not to do that.

lightsighter commented 1 year ago

I suppose I wouldn't mind doing option 5 also since I feel like that is misleading anyway since some memories (like sysmems) are usually not full, while other memories (GPU framebuffers) are very close to capacity.

syamajala commented 1 year ago

I am not hitting any assertion when I run with @lightsighter runtime patch, but I'm hitting the issue Elliott is looking at with dynamic memories though.

lightsighter commented 1 year ago

@syamajala Do the profiles you generate still fail with this error message when running with the patch?

thread '<unnamed>' panicked at 'Error: Copy can not find src_inst:0x8000004000500037', src/backend/common.rs:915:21
syamajala commented 1 year ago

No. They are failing with: thread '<unnamed>' panicked at 'assertion failed: sample_util <= 1.0', src/backend/data_source.rs:600:13

lightsighter commented 1 year ago

And that is true over repeated runs?

elliottslaughter commented 1 year ago

You can try cherry-picking https://gitlab.com/StanfordLegion/legion/-/merge_requests/920 to work around the dynamic memory capacity issue, on the off chance that helps expose the other bug.

syamajala commented 1 year ago

The work around for the dynamic memory issue works for me and I am able to generate profiles now.

lightsighter commented 1 year ago

I pushed a fix for the second failure mode to the control replication branch. @syamajala please pull and try again.