Open syamajala opened 1 year ago
I played with the log files and it turned out the instance 0x8000004000500037
is not logged, that's why the copy complains that there is no such instance 0x8000004000500037
. Here is the CopyInstInfo
that contains this instance:
src_mem: 0x1e0000000000000a dst_mem: 0x1e0001000000000a src_fid: 0 dst_fid: 0 src_inst: 0x8000004000500037 dst_inst: 0x8000801000a00050 fevent: 0x800080100de00018 num_hops: 1 indirect: False
I think we need @lightsighter to take a look at this issue.
Run with the latest control replication first.
With the latest control replication I still see a crash with archive mode of legion_prof, but the default mode works.
Created output directory "legion_prof.1"
Writing level 0 with 1 tiles
thread '<unnamed>' panicked at 'assertion failed: sample_util <= 1.0', src/backend/data_source.rs:600:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread '<unnamed>' panicked at 'assertion failed: sample_util <= 1.0', src/backend/data_source.rs:600:13
Rayon: detected unexpected panic; aborting
Rayon: detected unexpected panic; aborting
Aborted
I think @elliottslaughter needs to look at this. The logs are here: http://sapling2.stanford.edu/~seshu/rixs/logs/
I did another run and I hit the missing copy error again too on control_replication.
commit 61a919f8b2dce55619b6318de19b63aa5c1c56d9 (HEAD -> control_replication, origin/control_replication)
Author: Mike <mebauer@cs.stanford.edu>
Date: Wed Sep 6 17:02:41 2023 -0700
legion: make sure to clear references to the initial future when all-reduce operations are deactivated
Not sure if they are useful but profile logs are available here: http://sapling2.stanford.edu/~seshu/rixs/logs/missing_copy/
Are we sure that we are doing all the detach operations on external allocations? Any instances that Legion allocates will require those instances be deleted and have their profiling responses come back to the runtime before it can shutdown or it will hang. All that leaves left are the instances made for external allocations with attach/detach.
If you want me to look at it I will need a reproducer on sapling.
I think I found a root cause for https://github.com/StanfordLegion/legion/issues/1546#issuecomment-1711981596, but there is a design decision that needs to be made here.
When you have a mix of dynamically and statically-sized memories, the profiler currently takes the sum of the static memories to compute an upper bound. (This is for the sysmem overview across all nodes.) That is obviously wrong. However, what is the correct bound?
At the point we are in the code, it would be easy to compute a dynamic bound on all memories. However, that may be inaccurate because the static memories aren't necessarily fully used.
I can't see an easy way to split it out (e.g., sum all static memory sizes and then add the dynamic memory allocations on top of that). If we really want that as an answer, I'll have to re-architect the code to achieve it.
Edit: Another alternative would be to take the max of the static and dynamic capacities. That's a bit odd, but capacity for dynamic memories is odd anyway.
Edit 2: One more option is to split out the utilization overview for static vs dynamic memories. After all, "dynamic" memories are really for external instances, and arguably they're basically a different memory kind. That might give users a more accurate view of where memory is going, and it certainly makes it easier to calculate usage in the profiler. (To be clear, this is entirely a profiler-backend change. No changes in Legion are implicated here.)
Edit 3: @eddy16112 is asking if we need overviews for memories at all. I personally think they're useful, but one option is to remove them.
@syamajala Apply this patch to Legion and then try to generate the profile logs again. When it crashes report the backtrace with line numbers:
diff --git a/runtime/legion/runtime.cc b/runtime/legion/runtime.cc
index 00db22fb5..d221d793d 100644
--- a/runtime/legion/runtime.cc
+++ b/runtime/legion/runtime.cc
@@ -8252,6 +8252,13 @@ namespace Legion {
constraints,
dim_order);
Realm::ProfilingRequestSet no_requests;
+ if (runtime->profiler != NULL)
+ {
+ const RtUserEvent unique = Runtime::create_rt_user_event();
+ Runtime::trigger_event(unique);
+ runtime->profiler->add_inst_request(no_requests,
+ implicit_provenance, unique);
+ }
Realm::RegionInstance::create_instance(eager_pool_instance,
memory,
layout,
@@ -8272,9 +8279,7 @@ namespace Legion {
//--------------------------------------------------------------------------
MemoryManager::~MemoryManager(void)
//--------------------------------------------------------------------------
- {
- if (eager_pool_instance.exists())
- eager_pool_instance.destroy();
+ {
delete eager_allocator;
}
@@ -8413,6 +8418,8 @@ namespace Legion {
free_legion_instance(it->first, it->second);
pending_collectables.clear();
#endif
+ if (eager_pool_instance.exists())
+ eager_pool_instance.destroy();
}
//--------------------------------------------------------------------------
diff --git a/runtime/realm/inst_impl.cc b/runtime/realm/inst_impl.cc
index 33f959907..c3665d9ec 100644
--- a/runtime/realm/inst_impl.cc
+++ b/runtime/realm/inst_impl.cc
@@ -435,6 +435,7 @@ namespace Realm {
const ProfilingRequestSet& prs,
Event wait_on)
{
+ assert(!prs.empty());
return RegionInstanceImpl::create_instance(inst, memory, ilg, 0,
prs, wait_on);
}
@@ -446,6 +447,7 @@ namespace Realm {
const ProfilingRequestSet& prs,
Event wait_on)
{
+ assert(!prs.empty());
return RegionInstanceImpl::create_instance(inst, memory, ilg, &res,
prs, wait_on);
}
@@ -777,6 +779,7 @@ namespace Realm {
const ProfilingRequestSet& prs,
Event wait_on)
{
+ assert(!prs.empty());
MemoryImpl *m_impl = get_runtime()->get_memory_impl(memory);
RegionInstanceImpl *impl = m_impl->new_instance();
// we can fail to get a valid pointer if we are out of instance slots
At the point we are in the code, it would be easy to compute a dynamic bound on all memories. However, that may be inaccurate because the static memories aren't necessarily fully used.
@elliottslaughter I would compute the utilization of the static memories as normal based on their total size. For dynamic memories, I would compute their utilization based on the total size of all the live instances in them at any point. So 100% utilization of a dynamic memory is the point(s) where it hits its high water mark.
Maybe I'm not explaining myself clearly.
The profile provided by @syamajala has 4 nodes. It has 8 sysmems total. Of those, 4 (1 per node) are normal sysmem (what I called "statically sized" above, because it has a fixed size). Then there are 4 more external sysmems (what I called "dynamically sized", because the memory pool isn't fixed).
When we generate the utilization plot for a single memory, everything works. The logic in the profiler today is correct.
The problem is rendering the overview utilization plot. This covers the entire machine, so all 8 sysmems (both static and dynamic). Basically, to generate this utilization plot we pretend that this memory is a single aggregate memory across the entire machine, and we add up the sizes of the individual memories. Because the aggregate memory size is not 0 (it contains at least 1 static memory), it does not take the dynamic memory code path. So we successfully add up the sizes of the static memories, but then if there are enough dynamic allocations, that pushes us over.
You can see the problematic code here:
To recap, I can see a few different options:
Option (1) was exactly what I was proposing. You can decide if you think it is worth it or not to do that.
I suppose I wouldn't mind doing option 5 also since I feel like that is misleading anyway since some memories (like sysmems) are usually not full, while other memories (GPU framebuffers) are very close to capacity.
I am not hitting any assertion when I run with @lightsighter runtime patch, but I'm hitting the issue Elliott is looking at with dynamic memories though.
@syamajala Do the profiles you generate still fail with this error message when running with the patch?
thread '<unnamed>' panicked at 'Error: Copy can not find src_inst:0x8000004000500037', src/backend/common.rs:915:21
No. They are failing with: thread '<unnamed>' panicked at 'assertion failed: sample_util <= 1.0', src/backend/data_source.rs:600:13
And that is true over repeated runs?
You can try cherry-picking https://gitlab.com/StanfordLegion/legion/-/merge_requests/920 to work around the dynamic memory capacity issue, on the off chance that helps expose the other bug.
The work around for the dynamic memory issue works for me and I am able to generate profiles now.
I pushed a fix for the second failure mode to the control replication branch. @syamajala please pull and try again.
I am seeing a crash with legion_prof:
These are the two modes I care about right now.
The logs are here: http://sapling2.stanford.edu/~seshu/rixs/logs/
I should be using this commit of legion: