StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
686 stars 144 forks source link

Legion: Unable to find entry for index partition #1277

Closed syamajala closed 2 years ago

syamajala commented 2 years ago

Updating to the latest control_replication (af833c94b086) on Summit, I'm seeing this: [2 - 20004c33f890] 9.102109 {5}{runtime}: [error 482] LEGION ERROR: Unable to find entry for index partition 18. This is definitely a runtime bug. (from file /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc_summit/legion/runtime/legion/region_tree.cc:4661)

Here is a stack trace:

[10] Thread 8 (Thread 0x20048104f890 (LWP 1709290)):
[10] #0  0x00002000009e9ca0 in waitpid () from /lib64/power9/libc.so.6
[10] #1  0x0000200000954340 in do_system () from /lib64/power9/libc.so.6
[10] #2  0x00002000008c8ec8 in system_compat () from /lib64/power9/libpthread.so.0
[10] #3  0x0000200007088420 in gasneti_system_redirected () from /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc_su
mmit/legion/language/build/lib/librealm.so.1
[10] #4  0x0000200007088b8c in gasneti_bt_gdb () from /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc_summit/legion
/language/build/lib/librealm.so.1
[10] #5  0x000020000708d9b4 in gasneti_print_backtrace () from /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc_summ
it/legion/language/build/lib/librealm.so.1
[10] #6  0x000020000708df7c in _gasneti_print_backtrace_ifenabled () from /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d
_nscbc_summit/legion/language/build/lib/librealm.so.1
[10] #7  0x00002000060a3398 in gasneti_defaultSignalHandler () from /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc
_summit/legion/language/build/lib/librealm.so.1
[10] #8  <signal handler called>
[10] #9  0x0000200000943618 in raise () from /lib64/power9/libc.so.6
[10] #10 0x0000200000923a2c in abort () from /lib64/power9/libc.so.6
[10] #11 0x00002000049ec994 in Legion::Internal::Runtime::report_error_message (id=482, file_name=0x2000052896d0 "/gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc_summit/legion/runtime/legion/region_tree.cc", line=4661, message=0x20048101fca8 "Unable to find entry for index partition 18. This is definitely a runtime bug.") at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc_summit/legion/runtime/legion/runtime.cc:30991
[10] #12 0x0000200004857544 in Legion::Internal::RegionTreeForest::get_node (this=0x1f4f8250, part=..., defer=0x0, can_fail=false, first=true, local_only=false) at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc_summit/legion/runtime/legion/region_tree.cc:4661
[10] #13 0x0000200004845100 in Legion::Internal::RegionTreeForest::is_index_partition_disjoint (this=0x1f4f8250, p=...) at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc_summit/legion/runtime/legion/region_tree.cc:1344
[10] #14 0x00002000049b0238 in Legion::Internal::Runtime::is_index_partition_disjoint (this=0x1f530050, p=...) at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc_summit/legion/runtime/legion/runtime.cc:18254
[10] #15 0x000020000436b7d0 in Legion::Runtime::is_index_partition_disjoint (this=0x1f4f8210, p=...) at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc_summit/legion/runtime/legion/legion.cc:5421
[10] #16 0x000020000411ff70 in legion_index_partition_is_disjoint (runtime_=..., handle_=...) at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc_summit/legion/runtime/legion/legion_c.cc:1687
[10] #17 0x0000200000bcee38 in $<main> () from /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc_summit/build/C1A2/libregent_tasks.so
[10] #18 0x0000200000bae7e8 in $__regent_task_main_primary () from /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc_summit/build/C1A2/libregent_tasks.so
[10] #19 0x00002000066d3e1c in Realm::LocalTaskProcessor::execute_task (this=0x20f9b230, func_id=4123, task_args=...) at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc_summit/legion/runtime/realm/proc_impl.cc:1135
[10] #20 0x0000200006784508 in Realm::Task::execute_on_processor (this=0x20048806fc30, p=...) at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc_summit/legion/runtime/realm/tasks.cc:302
[10] #21 0x00002000067895c8 in Realm::KernelThreadTaskScheduler::execute_task (this=0x20f9b550, task=0x20048806fc30) at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc_summit/legion/runtime/realm/tasks.cc:1366
[10] #22 0x0000200006787ff0 in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x20f9b550) at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc_summit/legion/runtime/realm/tasks.cc:1105
[10] #23 0x00002000067886ec in Realm::ThreadedTaskScheduler::scheduler_loop_wlock (this=0x20f9b550) at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc_summit/legion/runtime/realm/tasks.cc:1217
[10] #24 0x0000200006794e2c in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop_wlock> (obj=0x20f9b550) at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc_summit/legion/runtime/realm/threads.inl:97
[10] #25 0x00002000067a9bec in Realm::KernelThread::pthread_entry (data=0x1f595bc0) at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc_summit/legion/runtime/realm/threads.cc:774
[10] #26 0x00002000008b8ae0 in start_thread () from /lib64/power9/libpthread.so.0
[10] #27 0x0000200000a2e7c8 in clone () from /lib64/power9/libc.so.6
lightsighter commented 2 years ago

@syamajala Are we sure this isn't a duplicate of #915 ?

syamajala commented 2 years ago

S3D has used a fence for a while now in between creating a partition and when it first gets used by a projection functor. That has not changed recently in the application code.

This issue only seems to appear on Summit and on the new commit, going back to an older Legion commit (ffca5154) everything works. I was able to run on Blaze with the newer commit without any problems. There have been no application changes in between updating the versions of Legion.

lightsighter commented 2 years ago

This issue only seems to appear on Summit and on the new commit, going back to an older Legion commit (https://github.com/StanfordLegion/legion/commit/ffca51544500f3149169d4f0b59651b063f23e77)

Ok, can you bisect exactly which commit causes the failure?

syamajala commented 2 years ago

I bisected on Summit and the first bad commit is https://github.com/StanfordLegion/legion/commit/be62515f19143779a53c4c1a6d679515c2196a14

lightsighter commented 2 years ago

@elliottslaughter I'm not sure it's safe to go full context-free right now with control replication.

elliottslaughter commented 2 years ago

What do you want to do?

This was a recent C API change, but the context-free C++ API had been around for a while. There could be C++ clients relying on it.

I could put in a context-ful C API, just to work around this in Regent. It will have some knock-on effects on Regent's analysis (which is why I wanted it context-free in the first place), but we could probably live with that in the short term. But are we going to get back to the context-free API eventually, or are expecting this to continue to be a problem?

lightsighter commented 2 years ago

It depends on what the group decides we're going to do about #915 which we've discussed in the last two Legion meetings. It's looking more and more like the context-free API is going to be deprecated in a lot of places (although not all). I don't see a way to support context-free in all the places it might be invoked and still be sound. I'm probably going to be willing to keep the context-free API intact, but it will become a fatal error if the runtime can't find an implicit context.

elliottslaughter commented 2 years ago

What about inside projection functors? I think that's the reason why I originally pushed the context-free version of this API. Also, it's not obvious to me what it means to issue a fence (or if this is even desirable) inside of a projection functor.

elliottslaughter commented 2 years ago

@syamajala can you try this workaround?

diff --git a/bindings/regent/regent_partitions.cc b/bindings/regent/regent_partitions.cc
index cf800b737..2b50fb613 100644
--- a/bindings/regent/regent_partitions.cc
+++ b/bindings/regent/regent_partitions.cc
@@ -641,11 +641,13 @@ legion_terra_index_cross_product_get_subpartition_by_color_domain_point(
   legion_domain_point_t color_)
 {
   Runtime *runtime = CObjectWrapper::unwrap(runtime_);
+  assert(Runtime::has_context());
+  Context ctx = Runtime::get_context();
   IndexPartition partition = CObjectWrapper::unwrap(prod.partition);
   DomainPoint color = CObjectWrapper::unwrap(color_);

-  IndexSpace is = runtime->get_index_subspace(partition, color);
-  IndexPartition ip = runtime->get_index_partition(is, prod.other_color);
+  IndexSpace is = runtime->get_index_subspace(ctx, partition, color);
+  IndexPartition ip = runtime->get_index_partition(ctx, is, prod.other_color);
   return CObjectWrapper::wrap(ip);
 }

diff --git a/runtime/legion/legion_c.cc b/runtime/legion/legion_c.cc
index 3293a6052..b51fbe556 100644
--- a/runtime/legion/legion_c.cc
+++ b/runtime/legion/legion_c.cc
@@ -2275,10 +2275,12 @@ legion_logical_partition_create(legion_runtime_t runtime_,
                                 legion_index_partition_t handle_)
 {
   Runtime *runtime = CObjectWrapper::unwrap(runtime_);
+  assert(Runtime::has_context());
+  Context ctx = Runtime::get_context();
   LogicalRegion parent = CObjectWrapper::unwrap(parent_);
   IndexPartition handle = CObjectWrapper::unwrap(handle_);

-  LogicalPartition r = runtime->get_logical_partition(parent, handle);
+  LogicalPartition r = runtime->get_logical_partition(ctx, parent, handle);
   return CObjectWrapper::wrap(r);
 }
syamajala commented 2 years ago

This patch works.

elliottslaughter commented 2 years ago

So I guess the question for @lightsighter is whether this is a solution we want to go with, at least long enough to be worth committing into the repo?

lightsighter commented 2 years ago

What about inside projection functors? I think that's the reason why I originally pushed the context-free version of this API. Also, it's not obvious to me what it means to issue a fence (or if this is even desirable) inside of a projection functor.

All the same issues I brought up in the Legion meeting two weeks ago. :)

So I guess the question for @lightsighter is whether this is a solution we want to go with, at least long enough to be worth committing into the repo?

Seems like it's probably worth a more detailed conversation in the Legion meeting for this week now that we have some more context and an actual example.

lightsighter commented 2 years ago

I looked at this code again, and I can't explain why it changes any functionality in the current implementation. I'm not convinced that it actually fixes the original problem. It might just be perturbing the timing.

syamajala commented 2 years ago

I dont think this issue is relevant anymore.