StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
689 stars 144 forks source link

legion: freeze launching stencil in S3D #1624

Closed syamajala closed 10 months ago

syamajala commented 10 months ago

I have updated to the latest control_replication in S3D and changed the mapper to use replicate_task, but I'm seeing a freeze when running multiple ranks. 1 rank seems to work.

Here are stack traces from a 2 rank run:

Thread 13 (Thread 0x1548fbfff000 (LWP 188272)):
#0  0x00001555528426ed in syscall () from /lib64/libc.so.6
#1  0x00001555496c7ab5 in Realm::Doorbell::wait_slow (this=0x1548fbffeaf0) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/mutex.cc:265
#2  0x0000155549488e80 in Realm::Doorbell::wait (this=0x1548fbffeaf0) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/mutex.inl:81
#3  0x0000155549606a17 in Realm::ThreadedTaskScheduler::WorkCounter::wait_for_work (this=0x7ae96a0, old_counter=12) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/tasks.cc:719
#4  0x0000155549608d0d in Realm::ThreadedTaskScheduler::wait_for_work (this=0x7ae94e0, old_work_counter=12) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/tasks.cc:1294
#5  0x0000155549609bb3 in Realm::KernelThreadTaskScheduler::wait_for_work (this=0x7ae94e0, old_work_counter=12) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/tasks.cc:1528
#6  0x0000155549608b60 in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x7ae94e0) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/tasks.cc:1260
#7  0x0000155549608c20 in Realm::ThreadedTaskScheduler::scheduler_loop_wlock (this=0x7ae94e0) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/tasks.cc:1272
#8  0x000015554960fd5c in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop_wlock> (obj=0x7ae94e0) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/threads.inl:97
#9  0x000015554961c0e3 in Realm::KernelThread::pthread_entry (data=0x1548f4000cf0) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/threads.cc:831
#10 0x0000155552b162de in start_thread () from /lib64/libpthread.so.0
#11 0x0000155552847e83 in clone () from /lib64/libc.so.6

Thread 12 (Thread 0x154909fff000 (LWP 188271)):
#0  0x00001555528426ed in syscall () from /lib64/libc.so.6
#1  0x00001555496c7ab5 in Realm::Doorbell::wait_slow (this=0x154909ffeaf0) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/mutex.cc:265
#2  0x0000155549488e80 in Realm::Doorbell::wait (this=0x154909ffeaf0) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/mutex.inl:81
#3  0x0000155549606a17 in Realm::ThreadedTaskScheduler::WorkCounter::wait_for_work (this=0x45d3e20, old_counter=12) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/tasks.cc:719
#4  0x0000155549608d0d in Realm::ThreadedTaskScheduler::wait_for_work (this=0x45d3c60, old_work_counter=12) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/tasks.cc:1294
#5  0x0000155549609bb3 in Realm::KernelThreadTaskScheduler::wait_for_work (this=0x45d3c60, old_work_counter=12) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/tasks.cc:1528
#6  0x0000155549608b60 in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x45d3c60) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/tasks.cc:1260
#7  0x0000155549608c20 in Realm::ThreadedTaskScheduler::scheduler_loop_wlock (this=0x45d3c60) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/tasks.cc:1272
#8  0x000015554960fd5c in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop_wlock> (obj=0x45d3c60) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/threads.inl:97
#9  0x000015554961c0e3 in Realm::KernelThread::pthread_entry (data=0x154902500d90) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/threads.cc:831
#10 0x0000155552b162de in start_thread () from /lib64/libpthread.so.0
#11 0x0000155552847e83 in clone () from /lib64/libc.so.6

Thread 11 (Thread 0x1555204ce000 (LWP 188269)):
#0  0x00001555528426ed in syscall () from /lib64/libc.so.6
#1  0x00001555496c7ab5 in Realm::Doorbell::wait_slow (this=0x1555204cdaf0) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/mutex.cc:265
#2  0x0000155549488e80 in Realm::Doorbell::wait (this=0x1555204cdaf0) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/mutex.inl:81
#3  0x0000155549606a17 in Realm::ThreadedTaskScheduler::WorkCounter::wait_for_work (this=0x7aea470, old_counter=1) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/tasks.cc:719
#4  0x0000155549608d0d in Realm::ThreadedTaskScheduler::wait_for_work (this=0x7aea2b0, old_work_counter=1) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/tasks.cc:1294
#5  0x0000155549609bb3 in Realm::KernelThreadTaskScheduler::wait_for_work (this=0x7aea2b0, old_work_counter=1) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/tasks.cc:1528
#6  0x0000155549608b60 in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x7aea2b0) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/tasks.cc:1260
#7  0x0000155549608c20 in Realm::ThreadedTaskScheduler::scheduler_loop_wlock (this=0x7aea2b0) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/tasks.cc:1272
#8  0x000015554960fd5c in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop_wlock> (obj=0x7aea2b0) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/threads.inl:97
#9  0x000015554961c0e3 in Realm::KernelThread::pthread_entry (data=0xb137d50) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/threads.cc:831
#10 0x0000155552b162de in start_thread () from /lib64/libpthread.so.0
#11 0x0000155552847e83 in clone () from /lib64/libc.so.6

Thread 10 (Thread 0x154b8dfff000 (LWP 188268)):
#0  0x00001555528426ed in syscall () from /lib64/libc.so.6
#1  0x00001555496c7ab5 in Realm::Doorbell::wait_slow (this=0x154b8dffeaf0) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/mutex.cc:265
#2  0x0000155549488e80 in Realm::Doorbell::wait (this=0x154b8dffeaf0) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/mutex.inl:81
#3  0x0000155549606a17 in Realm::ThreadedTaskScheduler::WorkCounter::wait_for_work (this=0x7ae9e30, old_counter=1) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/tasks.cc:719
#4  0x0000155549608d0d in Realm::ThreadedTaskScheduler::wait_for_work (this=0x7ae9c70, old_work_counter=1) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/tasks.cc:1294
#5  0x0000155549609bb3 in Realm::KernelThreadTaskScheduler::wait_for_work (this=0x7ae9c70, old_work_counter=1) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/tasks.cc:1528
#6  0x0000155549608b60 in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x7ae9c70) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/tasks.cc:1260
#7  0x0000155549608c20 in Realm::ThreadedTaskScheduler::scheduler_loop_wlock (this=0x7ae9c70) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/tasks.cc:1272
#8  0x000015554960fd5c in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop_wlock> (obj=0x7ae9c70) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/threads.inl:97
#9  0x000015554961c0e3 in Realm::KernelThread::pthread_entry (data=0xb137880) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/threads.cc:831
#10 0x0000155552b162de in start_thread () from /lib64/libpthread.so.0
#11 0x0000155552847e83 in clone () from /lib64/libc.so.6

Thread 9 (Thread 0x15550b07d000 (LWP 188265)):
#0  0x00001555528426ed in syscall () from /lib64/libc.so.6
#1  0x00001555496c7ab5 in Realm::Doorbell::wait_slow (this=0x15550b07caf0) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/mutex.cc:265
#2  0x0000155549488e80 in Realm::Doorbell::wait (this=0x15550b07caf0) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/mutex.inl:81
#3  0x0000155549606a17 in Realm::ThreadedTaskScheduler::WorkCounter::wait_for_work (this=0x512e390, old_counter=12) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/tasks.cc:719
#4  0x0000155549608d0d in Realm::ThreadedTaskScheduler::wait_for_work (this=0x512e1d0, old_work_counter=12) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/tasks.cc:1294
#5  0x0000155549609bb3 in Realm::KernelThreadTaskScheduler::wait_for_work (this=0x512e1d0, old_work_counter=12) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/tasks.cc:1528
#6  0x0000155549608b60 in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x512e1d0) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/tasks.cc:1260
#7  0x0000155549608c20 in Realm::ThreadedTaskScheduler::scheduler_loop_wlock (this=0x512e1d0) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/tasks.cc:1272
#8  0x000015554960fd5c in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop_wlock> (obj=0x512e1d0) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/threads.inl:97
#9  0x000015554961c0e3 in Realm::KernelThread::pthread_entry (data=0xb0f6ef0) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/threads.cc:831
#10 0x0000155552b162de in start_thread () from /lib64/libpthread.so.0
#11 0x0000155552847e83 in clone () from /lib64/libc.so.6

Thread 8 (Thread 0x1555205d4000 (LWP 188250)):
#0  0x00001555496a15d2 in Realm::UnfairMutex::trylock (this=0x45a5c70) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/mutex.inl:186
#1  0x00001555496982a3 in Realm::GASNetEXPoller::do_work (this=0x45a5c20, work_until=...) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/gasnetex/gasnetex_internal.cc:2808
#2  0x0000155549488aa8 in Realm::BackgroundWorkManager::Worker::do_work (this=0x1555205cf9d0, max_time_in_ns=-1, interrupt_flag=0x0) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/bgwork.cc:599
#3  0x00001555494867dd in Realm::BackgroundWorkThread::main_loop (this=0x7ae7f20) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/bgwork.cc:103
#4  0x0000155549489e7a in Realm::Thread::thread_entry_wrapper<Realm::BackgroundWorkThread, &Realm::BackgroundWorkThread::main_loop> (obj=0x7ae7f20) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/threads.inl:97
#5  0x000015554961c0e3 in Realm::KernelThread::pthread_entry (data=0x4762e30) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/threads.cc:831
#6  0x0000155552b162de in start_thread () from /lib64/libpthread.so.0
#7  0x0000155552847e83 in clone () from /lib64/libc.so.6

Thread 7 (Thread 0x15553c14c000 (LWP 188249)):
#0  0x00001555528426ed in syscall () from /lib64/libc.so.6
#1  0x00001555496c7ab5 in Realm::Doorbell::wait_slow (this=0x15553c14baf0) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/mutex.cc:265
#2  0x0000155549488e80 in Realm::Doorbell::wait (this=0x15553c14baf0) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/mutex.inl:81
#3  0x0000155549486955 in Realm::BackgroundWorkThread::main_loop (this=0x7ae7ff0) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/bgwork.cc:144
#4  0x0000155549489e7a in Realm::Thread::thread_entry_wrapper<Realm::BackgroundWorkThread, &Realm::BackgroundWorkThread::main_loop> (obj=0x7ae7ff0) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/threads.inl:97
#5  0x000015554961c0e3 in Realm::KernelThread::pthread_entry (data=0x7ae8090) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/threads.cc:831
#6  0x0000155552b162de in start_thread () from /lib64/libpthread.so.0
#7  0x0000155552847e83 in clone () from /lib64/libc.so.6

Thread 6 (Thread 0x1555207d5000 (LWP 188242)):
#0  0x000015555283cf21 in poll () from /lib64/libc.so.6
#1  0x00001555473d8b89 in ?? () from /cm/local/apps/cuda/libs/current/lib64/libcuda.so.1
#2  0x000015554747fd7b in ?? () from /cm/local/apps/cuda/libs/current/lib64/libcuda.so.1
#3  0x00001555473d3b98 in ?? () from /cm/local/apps/cuda/libs/current/lib64/libcuda.so.1
#4  0x0000155552b162de in start_thread () from /lib64/libpthread.so.0
#5  0x0000155552847e83 in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x1555209d6000 (LWP 188240)):
#0  0x000015555283cf21 in poll () from /lib64/libc.so.6
#1  0x00001555473d8b89 in ?? () from /cm/local/apps/cuda/libs/current/lib64/libcuda.so.1
#2  0x000015554747fd7b in ?? () from /cm/local/apps/cuda/libs/current/lib64/libcuda.so.1
#3  0x00001555473d3b98 in ?? () from /cm/local/apps/cuda/libs/current/lib64/libcuda.so.1
#4  0x0000155552b162de in start_thread () from /lib64/libpthread.so.0
#5  0x0000155552847e83 in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x15553ec19000 (LWP 188226)):
#0  0x00001555528481b7 in epoll_wait () from /lib64/libc.so.6
#1  0x00001555459bb0a9 in ucs_event_set_wait () from /lib64/libucs.so.0
#2  0x00001555459aa13c in ucs_async_thread_func () from /lib64/libucs.so.0
#3  0x0000155552b162de in start_thread () from /lib64/libpthread.so.0
#4  0x0000155552847e83 in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x1555447e4000 (LWP 188219)):
#0  0x00001555528481b7 in epoll_wait () from /lib64/libc.so.6
#1  0x0000155546440ba1 in epoll_dispatch (base=0x413c950, tv=<optimized out>) at epoll.c:407
#2  0x0000155546443d7d in opal_libevent2022_event_base_loop (base=0x413c950, flags=1) at event.c:1630
#3  0x000015554652fa6e in progress_engine () from /shared/openmpi-4.0.5/gcc-9.2.0/lib/libopen-pal.so.40
#4  0x0000155552b162de in start_thread () from /lib64/libpthread.so.0
#5  0x0000155552847e83 in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x1555449e5000 (LWP 188216)):
#0  0x000015555283cf21 in poll () from /lib64/libc.so.6
#1  0x000015554644be6d in poll_dispatch (base=0x40eca50, tv=<optimized out>) at poll.c:165
#2  0x0000155546443d7d in opal_libevent2022_event_base_loop (base=0x40eca50, flags=1) at event.c:1630
#3  0x00001555463ee83e in progress_engine () from /shared/openmpi-4.0.5/gcc-9.2.0/lib/libopen-pal.so.40
#4  0x0000155552b162de in start_thread () from /lib64/libpthread.so.0
#5  0x0000155552847e83 in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x155555516000 (LWP 188214)):
#0  0x0000155552b1c48c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00001555496c9016 in Realm::KernelCondVar::wait (this=0x8b9a920) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/mutex.cc:1089
#2  0x00001555494929d3 in Realm::GenEventImpl::external_wait (this=0x8b9a770, gen_needed=2, poisoned=@0x7fffffff9a6f: false) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/event_impl.cc:1697
#3  0x000015554948c518 in Realm::Event::external_wait_faultaware (this=0x7fffffff9cd8, poisoned=@0x7fffffff9a6f: false) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/event_impl.cc:281
#4  0x000015554948c2bd in Realm::Event::external_wait (this=0x7fffffff9cd8) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/event_impl.cc:253
#5  0x000015554c9890dc in Legion::Internal::LegionHandshakeImpl::ext_wait_on_legion (this=0x44aad10) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/legion/runtime.cc:6696
#6  0x000015554c537040 in Legion::LegionHandshake::ext_wait_on_legion (this=0x44aaae0) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/legion/legion.cc:2232
#7  0x00001555547a0f50 in Legion::MPILegionHandshake::mpi_wait_on_legion (this=0x44aaae0) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/legion.h:4378
#8  0x000015555479e96c in S3DRank::complete_configuration (this=0x44aaa40) at s3d_rank_mpi.cc:358
#9  0x000015555479c8f4 in complete_legion_configure_ () at rhst_fortran.cc:164
#10 0x00000000005c9a90 in solve_driver (io=6) at /lustre/scratch/vsyamaj/legion_s3d_test2/s3d/source/drivers/solve_driver.f90:227
#11 0x00000000005c9382 in s3d () at /lustre/scratch/vsyamaj/legion_s3d_test2/s3d/source/drivers/main.f90:131
#12 0x0000000000404f4d in main (argc=<optimized out>, argv=<optimized out>) at /lustre/scratch/vsyamaj/legion_s3d_test2/s3d/source/drivers/main.f90:8
#13 0x000015555276f6a3 in __libc_start_main () from /lib64/libc.so.6
#14 0x0000000000404f8e in _start () at /lustre/scratch/vsyamaj/legion_s3d_test2/s3d/source/drivers/main.f90:8

Thread 17 (Thread 0x155505fff000 (LWP 188303)):
#0  0x00001555528426ed in syscall () from /lib64/libc.so.6
#1  0x00001555496c7ab5 in Realm::Doorbell::wait_slow (this=0x155505ffeaf0) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/mutex.cc:265
#2  0x0000155549488e80 in Realm::Doorbell::wait (this=0x155505ffeaf0) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/mutex.inl:81
#3  0x0000155549606a17 in Realm::ThreadedTaskScheduler::WorkCounter::wait_for_work (this=0x7af26e0, old_counter=2834) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/tasks.cc:719
#4  0x0000155549608d0d in Realm::ThreadedTaskScheduler::wait_for_work (this=0x7af2520, old_work_counter=2834) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/tasks.cc:1294
#5  0x0000155549609bb3 in Realm::KernelThreadTaskScheduler::wait_for_work (this=0x7af2520, old_work_counter=2834) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/tasks.cc:1528
#6  0x0000155549608b60 in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x7af2520) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/tasks.cc:1260
#7  0x0000155549608c20 in Realm::ThreadedTaskScheduler::scheduler_loop_wlock (this=0x7af2520) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/tasks.cc:1272
#8  0x000015554960fd5c in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop_wlock> (obj=0x7af2520) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/threads.inl:97
#9  0x000015554961c0e3 in Realm::KernelThread::pthread_entry (data=0x1548dc054230) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/threads.cc:831
#10 0x0000155552b162de in start_thread () from /lib64/libpthread.so.0
#11 0x0000155552847e83 in clone () from /lib64/libc.so.6

Thread 16 (Thread 0x1548d5ff3000 (LWP 188302)):
#0  0x00001555528426ed in syscall () from /lib64/libc.so.6
#1  0x00001555496c7ab5 in Realm::Doorbell::wait_slow (this=0x1548d5ff2af0) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/mutex.cc:265
#2  0x0000155549488e80 in Realm::Doorbell::wait (this=0x1548d5ff2af0) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/mutex.inl:81
#3  0x0000155549606a17 in Realm::ThreadedTaskScheduler::WorkCounter::wait_for_work (this=0x7af35d0, old_counter=1557) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/tasks.cc:719
#4  0x0000155549608d0d in Realm::ThreadedTaskScheduler::wait_for_work (this=0x7af3410, old_work_counter=1557) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/tasks.cc:1294
#5  0x0000155549609bb3 in Realm::KernelThreadTaskScheduler::wait_for_work (this=0x7af3410, old_work_counter=1557) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/tasks.cc:1528
#6  0x0000155549608b60 in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x7af3410) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/tasks.cc:1260
#7  0x0000155549608c20 in Realm::ThreadedTaskScheduler::scheduler_loop_wlock (this=0x7af3410) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/tasks.cc:1272
#8  0x000015554960fd5c in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop_wlock> (obj=0x7af3410) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/threads.inl:97
#9  0x000015554961c0e3 in Realm::KernelThread::pthread_entry (data=0x1548fe74adc0) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/threads.cc:831
#10 0x0000155552b162de in start_thread () from /lib64/libpthread.so.0
#11 0x0000155552847e83 in clone () from /lib64/libc.so.6

Thread 15 (Thread 0x1548d6ff9000 (LWP 188301)):
#0  0x00001555528426ed in syscall () from /lib64/libc.so.6
#1  0x00001555496c7ab5 in Realm::Doorbell::wait_slow (this=0x1548d6ff8af0) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/mutex.cc:265
#2  0x0000155549488e80 in Realm::Doorbell::wait (this=0x1548d6ff8af0) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/mutex.inl:81
#3  0x00001555496c8f10 in Realm::FIFOCondVar::wait (this=0x1548d6fefd30) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/mutex.cc:1029
#4  0x000015554960993b in Realm::KernelThreadTaskScheduler::worker_sleep (this=0x7af3410, switch_to=0x1548fe74adc0) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/tasks.cc:1469
#5  0x000015554960790f in Realm::ThreadedTaskScheduler::thread_blocking (this=0x7af3410, thread=0x1548ee50e960) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/tasks.cc:963
#6  0x000015554949cde7 in Realm::Thread::wait_for_condition<Realm::EventTriggeredCondition> (cond=..., poisoned=@0x1548d6ff1357: false) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/threads.inl:218
#7  0x000015554948c15c in Realm::Event::wait_faultaware (this=0x1548d6ff1358, poisoned=@0x1548d6ff1357: false) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/event_impl.cc:242
#8  0x000015554c4a0c1d in Legion::Internal::LgEvent::wait_faultaware (this=0x1548d6ff1358, poisoned=@0x1548d6ff1357: false, from_app=true) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/legion/legion_types.h:3364
#9  0x000015554c4a0918 in Legion::Internal::ApEvent::wait_faultaware (this=0x1548d6ff1358, poisoned=@0x1548d6ff1357: false) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/legion/legion_types.h:2926
#10 0x000015554c52f99c in Legion::PhaseBarrier::wait (this=0x44b4a60) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/legion/legion.cc:715
#11 0x000015554c989156 in Legion::Internal::LegionHandshakeImpl::legion_wait_on_ext (this=0x44b4a30) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/legion/runtime.cc:6718
#12 0x000015554c5370d4 in Legion::LegionHandshake::legion_wait_on_ext (this=0x44b4800) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/legion/legion.cc:2252
#13 0x00001555547b75ce in Legion::MPILegionHandshake::legion_wait_on_mpi (this=0x44b4800) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/legion.h:4389
#14 0x00001555547b5212 in legion_wait_on_mpi () at s3d_rank_wrapper.cc:329
#15 0x00001555521af9a4 in $<AwaitMPITask> () from /lustre/scratch/vsyamaj/legion_s3d_test//build/hept/libregent_tasks.so
#16 0x00001555521ae5b9 in $__regent_task_AwaitMPITask_1_primary () from /lustre/scratch/vsyamaj/legion_s3d_test//build/hept/libregent_tasks.so
#17 0x000015554958ec4a in Realm::LocalTaskProcessor::execute_task (this=0x7af30a0, func_id=304, task_args=...) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/proc_impl.cc:1176
#18 0x0000155549605852 in Realm::Task::execute_on_processor (this=0x1548fe5cf7a0, p=...) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/tasks.cc:326
#19 0x000015554960977e in Realm::KernelThreadTaskScheduler::execute_task (this=0x7af3410, task=0x1548fe5cf7a0) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/tasks.cc:1421
#20 0x00001555496085fd in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x7af3410) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/tasks.cc:1160
#21 0x0000155549608c20 in Realm::ThreadedTaskScheduler::scheduler_loop_wlock (this=0x7af3410) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/tasks.cc:1272
#22 0x000015554960fd5c in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop_wlock> (obj=0x7af3410) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/threads.inl:97
#23 0x000015554961c0e3 in Realm::KernelThread::pthread_entry (data=0x1548ee50e960) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/threads.cc:831
#24 0x0000155552b162de in start_thread () from /lib64/libpthread.so.0
#25 0x0000155552847e83 in clone () from /lib64/libc.so.6

Thread 14 (Thread 0x1548c2ff9000 (LWP 188300)):
#0  0x00001555528426ed in syscall () from /lib64/libc.so.6
#1  0x00001555496c7ab5 in Realm::Doorbell::wait_slow (this=0x1548c2ff8af0) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/mutex.cc:265
#2  0x0000155549488e80 in Realm::Doorbell::wait (this=0x1548c2ff8af0) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/mutex.inl:81
#3  0x0000155549606a17 in Realm::ThreadedTaskScheduler::WorkCounter::wait_for_work (this=0x5137910, old_counter=2856) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/tasks.cc:719
#4  0x0000155549608d0d in Realm::ThreadedTaskScheduler::wait_for_work (this=0x5137750, old_work_counter=2856) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/tasks.cc:1294
#5  0x0000155549609bb3 in Realm::KernelThreadTaskScheduler::wait_for_work (this=0x5137750, old_work_counter=2856) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/tasks.cc:1528
#6  0x0000155549608b60 in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x5137750) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/tasks.cc:1260
#7  0x0000155549608c20 in Realm::ThreadedTaskScheduler::scheduler_loop_wlock (this=0x5137750) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/tasks.cc:1272
#8  0x000015554960fd5c in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop_wlock> (obj=0x5137750) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/threads.inl:97
#9  0x000015554961c0e3 in Realm::KernelThread::pthread_entry (data=0x1548bc009d70) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/threads.cc:831
#10 0x0000155552b162de in start_thread () from /lib64/libpthread.so.0
#11 0x0000155552847e83 in clone () from /lib64/libc.so.6

Thread 13 (Thread 0x1548c3fff000 (LWP 188299)):
#0  0x00001555528426ed in syscall () from /lib64/libc.so.6
#1  0x00001555496c7ab5 in Realm::Doorbell::wait_slow (this=0x1548c3ffeaf0) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/mutex.cc:265
#2  0x0000155549488e80 in Realm::Doorbell::wait (this=0x1548c3ffeaf0) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/mutex.inl:81
#3  0x0000155549606a17 in Realm::ThreadedTaskScheduler::WorkCounter::wait_for_work (this=0x7af2e40, old_counter=2848) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/tasks.cc:719
#4  0x0000155549608d0d in Realm::ThreadedTaskScheduler::wait_for_work (this=0x7af2c80, old_work_counter=2848) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/tasks.cc:1294
#5  0x0000155549609bb3 in Realm::KernelThreadTaskScheduler::wait_for_work (this=0x7af2c80, old_work_counter=2848) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/tasks.cc:1528
#6  0x0000155549608b60 in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x7af2c80) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/tasks.cc:1260
#7  0x0000155549608c20 in Realm::ThreadedTaskScheduler::scheduler_loop_wlock (this=0x7af2c80) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/tasks.cc:1272
#8  0x000015554960fd5c in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop_wlock> (obj=0x7af2c80) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/threads.inl:97
#9  0x000015554961c0e3 in Realm::KernelThread::pthread_entry (data=0x1548c80097f0) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/threads.cc:831
#10 0x0000155552b162de in start_thread () from /lib64/libpthread.so.0
#11 0x0000155552847e83 in clone () from /lib64/libc.so.6

Thread 12 (Thread 0x1548e6ff9000 (LWP 188287)):
#0  0x00001555528426ed in syscall () from /lib64/libc.so.6
#1  0x00001555496c7ab5 in Realm::Doorbell::wait_slow (this=0x1548e6ff8af0) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/mutex.cc:265
#2  0x0000155549488e80 in Realm::Doorbell::wait (this=0x1548e6ff8af0) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/mutex.inl:81
#3  0x00001555496c8f10 in Realm::FIFOCondVar::wait (this=0x1548e6ff2f00) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/mutex.cc:1029
#4  0x000015554960993b in Realm::KernelThreadTaskScheduler::worker_sleep (this=0x7af2520, switch_to=0x1548dc054230) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/tasks.cc:1469
#5  0x000015554960790f in Realm::ThreadedTaskScheduler::thread_blocking (this=0x7af2520, thread=0x1548fe737fd0) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/tasks.cc:963
#6  0x000015554949cde7 in Realm::Thread::wait_for_condition<Realm::EventTriggeredCondition> (cond=..., poisoned=@0x1548e6ff449f: false) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/threads.inl:218
#7  0x000015554948c15c in Realm::Event::wait_faultaware (this=0x1548e6ff4768, poisoned=@0x1548e6ff449f: false) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/event_impl.cc:242
#8  0x000015554948bd69 in Realm::Event::wait (this=0x1548e6ff4768) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/event_impl.cc:194
#9  0x000015554e267dde in Legion::Internal::LgEvent::wait (this=0x1548e6ff4768) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/legion/legion_types.h:3283
#10 0x000015554c7c7248 in Legion::Internal::TraceReplayOp::trigger_dependence_analysis (this=0x1548fabbfed0) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/legion/legion_trace.cc:1219
#11 0x000015554c584c6c in Legion::Internal::Operation::execute_dependence_analysis (this=0x1548fabbfed0) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/legion/legion_ops.cc:1631
#12 0x000015554c45118a in Legion::Internal::InnerContext::process_dependence_stage (this=0x1548ee506d60) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/legion/legion_context.cc:8448
#13 0x000015554c460768 in Legion::Internal::InnerContext::handle_dependence_stage (args=0x1548dc41e930) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/legion/legion_context.cc:12305
#14 0x000015554c9e806b in Legion::Internal::Runtime::legion_runtime_task (args=0x1548dc41e930, arglen=12, userdata=0xb0b3720, userlen=8, p=...) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/legion/runtime.cc:32209
#15 0x000015554958ec4a in Realm::LocalTaskProcessor::execute_task (this=0x7af21d0, func_id=4, task_args=...) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/proc_impl.cc:1176
#16 0x0000155549605852 in Realm::Task::execute_on_processor (this=0x1548dc41e7b0, p=...) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/tasks.cc:326
#17 0x000015554960977e in Realm::KernelThreadTaskScheduler::execute_task (this=0x7af2520, task=0x1548dc41e7b0) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/tasks.cc:1421
#18 0x00001555496085fd in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x7af2520) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/tasks.cc:1160
#19 0x0000155549608c20 in Realm::ThreadedTaskScheduler::scheduler_loop_wlock (this=0x7af2520) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/tasks.cc:1272
#20 0x000015554960fd5c in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop_wlock> (obj=0x7af2520) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/threads.inl:97
#21 0x000015554961c0e3 in Realm::KernelThread::pthread_entry (data=0x1548fe737fd0) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/threads.cc:831
#22 0x0000155552b162de in start_thread () from /lib64/libpthread.so.0
#23 0x0000155552847e83 in clone () from /lib64/libc.so.6

Thread 11 (Thread 0x154907ff3000 (LWP 188270)):
#0  0x00001555528426ed in syscall () from /lib64/libc.so.6
#1  0x00001555496c7ab5 in Realm::Doorbell::wait_slow (this=0x154907ff2af0) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/mutex.cc:265
#2  0x0000155549488e80 in Realm::Doorbell::wait (this=0x154907ff2af0) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/mutex.inl:81
#3  0x00001555496c8f10 in Realm::FIFOCondVar::wait (this=0x154907fe9d30) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/mutex.cc:1029
#4  0x000015554960993b in Realm::KernelThreadTaskScheduler::worker_sleep (this=0x7af3410, switch_to=0x1548ee50e960) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/tasks.cc:1469
#5  0x000015554960790f in Realm::ThreadedTaskScheduler::thread_blocking (this=0x7af3410, thread=0x1548fa505ce0) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/tasks.cc:963
#6  0x000015554949cde7 in Realm::Thread::wait_for_condition<Realm::EventTriggeredCondition> (cond=..., poisoned=@0x154907feb357: false) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/threads.inl:218
#7  0x000015554948c15c in Realm::Event::wait_faultaware (this=0x154907feb358, poisoned=@0x154907feb357: false) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/event_impl.cc:242
#8  0x000015554c4a0c1d in Legion::Internal::LgEvent::wait_faultaware (this=0x154907feb358, poisoned=@0x154907feb357: false, from_app=true) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/legion/legion_types.h:3364
#9  0x000015554c4a0918 in Legion::Internal::ApEvent::wait_faultaware (this=0x154907feb358, poisoned=@0x154907feb357: false) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/legion/legion_types.h:2926
#10 0x000015554c52f99c in Legion::PhaseBarrier::wait (this=0x44b4a60) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/legion/legion.cc:715
#11 0x000015554c989156 in Legion::Internal::LegionHandshakeImpl::legion_wait_on_ext (this=0x44b4a30) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/legion/runtime.cc:6718
#12 0x000015554c5370d4 in Legion::LegionHandshake::legion_wait_on_ext (this=0x44b4800) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/legion/legion.cc:2252
#13 0x00001555547b75ce in Legion::MPILegionHandshake::legion_wait_on_mpi (this=0x44b4800) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/legion.h:4389
#14 0x00001555547b5212 in legion_wait_on_mpi () at s3d_rank_wrapper.cc:329
#15 0x00001555521af9a4 in $<AwaitMPITask> () from /lustre/scratch/vsyamaj/legion_s3d_test//build/hept/libregent_tasks.so
#16 0x00001555521ae5b9 in $__regent_task_AwaitMPITask_1_primary () from /lustre/scratch/vsyamaj/legion_s3d_test//build/hept/libregent_tasks.so
#17 0x000015554958ec4a in Realm::LocalTaskProcessor::execute_task (this=0x7af30a0, func_id=304, task_args=...) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/proc_impl.cc:1176
#18 0x0000155549605852 in Realm::Task::execute_on_processor (this=0x1548fe6720d0, p=...) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/tasks.cc:326
#19 0x000015554960977e in Realm::KernelThreadTaskScheduler::execute_task (this=0x7af3410, task=0x1548fe6720d0) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/tasks.cc:1421
#20 0x00001555496085fd in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x7af3410) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/tasks.cc:1160
#21 0x0000155549608c20 in Realm::ThreadedTaskScheduler::scheduler_loop_wlock (this=0x7af3410) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/tasks.cc:1272
#22 0x000015554960fd5c in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop_wlock> (obj=0x7af3410) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/threads.inl:97
#23 0x000015554961c0e3 in Realm::KernelThread::pthread_entry (data=0x1548fa505ce0) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/threads.cc:831
#24 0x0000155552b162de in start_thread () from /lib64/libpthread.so.0
#25 0x0000155552847e83 in clone () from /lib64/libc.so.6

Thread 10 (Thread 0x15550ad76000 (LWP 188264)):
#0  0x00001555528426ed in syscall () from /lib64/libc.so.6
#1  0x00001555496c7ab5 in Realm::Doorbell::wait_slow (this=0x15550ad75af0) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/mutex.cc:265
#2  0x0000155549488e80 in Realm::Doorbell::wait (this=0x15550ad75af0) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/mutex.inl:81
#3  0x0000155549606a17 in Realm::ThreadedTaskScheduler::WorkCounter::wait_for_work (this=0x7af3c10, old_counter=1) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/tasks.cc:719
#4  0x0000155549608d0d in Realm::ThreadedTaskScheduler::wait_for_work (this=0x7af3a50, old_work_counter=1) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/tasks.cc:1294
#5  0x0000155549609bb3 in Realm::KernelThreadTaskScheduler::wait_for_work (this=0x7af3a50, old_work_counter=1) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/tasks.cc:1528
#6  0x0000155549608b60 in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x7af3a50) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/tasks.cc:1260
#7  0x0000155549608c20 in Realm::ThreadedTaskScheduler::scheduler_loop_wlock (this=0x7af3a50) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/tasks.cc:1272
#8  0x000015554960fd5c in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop_wlock> (obj=0x7af3a50) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/threads.inl:97
#9  0x000015554961c0e3 in Realm::KernelThread::pthread_entry (data=0xb0b2d00) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/threads.cc:831
#10 0x0000155552b162de in start_thread () from /lib64/libpthread.so.0
#11 0x0000155552847e83 in clone () from /lib64/libc.so.6

Thread 9 (Thread 0x154908ff9000 (LWP 188263)):
#0  0x00001555528426ed in syscall () from /lib64/libc.so.6
#1  0x00001555496c7ab5 in Realm::Doorbell::wait_slow (this=0x154908ff8af0) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/mutex.cc:265
#2  0x0000155549488e80 in Realm::Doorbell::wait (this=0x154908ff8af0) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/mutex.inl:81
#3  0x00001555496c8f10 in Realm::FIFOCondVar::wait (this=0x154908fc5880) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/mutex.cc:1029
#4  0x000015554960993b in Realm::KernelThreadTaskScheduler::worker_sleep (this=0x7af3410, switch_to=0x1548fa505ce0) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/tasks.cc:1469
#5  0x0000155549607885 in Realm::ThreadedTaskScheduler::thread_blocking (this=0x7af3410, thread=0xb100a40) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/tasks.cc:951
#6  0x000015554949cde7 in Realm::Thread::wait_for_condition<Realm::EventTriggeredCondition> (cond=..., poisoned=@0x154908fc6e1f: false) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/threads.inl:218
#7  0x000015554948c15c in Realm::Event::wait_faultaware (this=0x154908fc70e0, poisoned=@0x154908fc6e1f: false) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/event_impl.cc:242
#8  0x000015554948bd69 in Realm::Event::wait (this=0x154908fc70e0) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/event_impl.cc:194
#9  0x000015554e267dde in Legion::Internal::LgEvent::wait (this=0x154908fc70e0) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/legion/legion_types.h:3283
#10 0x000015554c457541 in Legion::Internal::InnerContext::is_replaying_physical_trace (this=0x1548ee506d60) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/legion/legion_context.cc:10223
#11 0x000015554c4537e8 in Legion::Internal::InnerContext::register_new_child_operation (this=0x1548ee506d60, op=0x1548fae387d0, resolved=..., dependences=0x0) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/legion/legion_context.cc:9129
#12 0x000015554c5847fa in Legion::Internal::Operation::initialize_operation (this=0x1548fae387d0, ctx=0x1548ee506d60, track=true, regs=5, prov=0x0, dependences=0x0) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/legion/legion_ops.cc:1533
#13 0x000015554c58d7d6 in Legion::Internal::PredicatedOp::initialize_predication (this=0x1548fae387d0, ctx=0x1548ee506d60, track=true, regions=5, dependences=0x0, p=..., provenance=0x0) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/legion/legion_ops.cc:4487
#14 0x000015554c76255c in Legion::Internal::TaskOp::initialize_base_task (this=0x1548fae38620, ctx=0x1548ee506d60, track=true, dependences=0x0, p=..., tid=9, prov=0x0) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/legion/legion_tasks.cc:654
#15 0x000015554c7864dd in Legion::Internal::IndexTask::initialize_task (this=0x1548fae38620, ctx=0x1548ee506d60, launcher=..., launch_sp=..., provenance=0x0, track=true, outputs=0x0) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/legion/legion_tasks.cc:8944
#16 0x000015554c4493bf in Legion::Internal::InnerContext::execute_index_space (this=0x1548ee506d60, launcher=..., outputs=0x0) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/legion/legion_context.cc:6911
#17 0x000015554c9bbfd7 in Legion::Internal::Runtime::execute_index_space (this=0x8b8d020, ctx=0x1548ee506d60, launcher=..., outputs=0x0) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/legion/runtime.cc:18855
#18 0x000015554c54705e in Legion::Runtime::execute_index_space (this=0x8ba2a00, ctx=0x1548ee506d60, launcher=..., outputs=0x0) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/legion/legion.cc:6320
#19 0x00001555547b8798 in S3DTask<CalcStencilTask, 3>::launch (this=0x154908fca4f0, ctx=0x1548ee506d60, runtime=0x8ba2a00) at s3d_task.h:241
#20 0x00001555547b65eb in launch_stencil1 (ctx_=..., runtime_=..., launch_space_=..., lr_=..., lp_rank_=..., src_field=0x154908fcc650, dst_field=0x154908fcc654, num_fields=1, dim=1, shift=2, lr_scale_=..., lp_scale_rank_=..., scale_field=1070) at s3d_rank_wrapper.cc:587
#21 0x0000155552254299 in $<main> () from /lustre/scratch/vsyamaj/legion_s3d_test//build/hept/libregent_tasks.so
#22 0x0000155552214d3d in $__regent_task_main_primary () from /lustre/scratch/vsyamaj/legion_s3d_test//build/hept/libregent_tasks.so
#23 0x000015554958ec4a in Realm::LocalTaskProcessor::execute_task (this=0x7af30a0, func_id=346, task_args=...) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/proc_impl.cc:1176
#24 0x0000155549605852 in Realm::Task::execute_on_processor (this=0x1548ee509ee0, p=...) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/tasks.cc:326
#25 0x000015554960977e in Realm::KernelThreadTaskScheduler::execute_task (this=0x7af3410, task=0x1548ee509ee0) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/tasks.cc:1421
#26 0x00001555496085fd in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x7af3410) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/tasks.cc:1160
#27 0x0000155549608c20 in Realm::ThreadedTaskScheduler::scheduler_loop_wlock (this=0x7af3410) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/tasks.cc:1272
#28 0x000015554960fd5c in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop_wlock> (obj=0x7af3410) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/threads.inl:97
#29 0x000015554961c0e3 in Realm::KernelThread::pthread_entry (data=0xb100a40) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/threads.cc:831
#30 0x0000155552b162de in start_thread () from /lib64/libpthread.so.0
#31 0x0000155552847e83 in clone () from /lib64/libc.so.6

Thread 8 (Thread 0x15550ae7c000 (LWP 188251)):
#0  0x00001555528426ed in syscall () from /lib64/libc.so.6
#1  0x00001555496c7ab5 in Realm::Doorbell::wait_slow (this=0x15550ae7baf0) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/mutex.cc:265
#2  0x0000155549488e80 in Realm::Doorbell::wait (this=0x15550ae7baf0) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/mutex.inl:81
#3  0x0000155549486955 in Realm::BackgroundWorkThread::main_loop (this=0x7af1240) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/bgwork.cc:144
#4  0x0000155549489e7a in Realm::Thread::thread_entry_wrapper<Realm::BackgroundWorkThread, &Realm::BackgroundWorkThread::main_loop> (obj=0x7af1240) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/threads.inl:97
#5  0x000015554961c0e3 in Realm::KernelThread::pthread_entry (data=0x7af12e0) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/threads.cc:831
#6  0x0000155552b162de in start_thread () from /lib64/libpthread.so.0
#7  0x0000155552847e83 in clone () from /lib64/libc.so.6

Thread 7 (Thread 0x155520117000 (LWP 188248)):
#0  0x000015553f3648b2 in mlx5_poll_cq_v1 () from /usr/lib64/libmlx5.so.1
#1  0x0000155549c3f0b2 in gasnetc_poll_rcv_hca () from /lustre/scratch/vsyamaj/legion_s3d_test//legion/language/build/lib/librealm.so.1
#2  0x0000155549c3f98c in gasnetc_poll_rcv_all () from /lustre/scratch/vsyamaj/legion_s3d_test//legion/language/build/lib/librealm.so.1
#3  0x0000155549c3fbd4 in gasnetc_do_poll () from /lustre/scratch/vsyamaj/legion_s3d_test//legion/language/build/lib/librealm.so.1
#4  0x0000155549c313d3 in gasnetc_AMPoll () from /lustre/scratch/vsyamaj/legion_s3d_test//legion/language/build/lib/librealm.so.1
#5  0x000015554969877c in _gasneti_AMPoll () at /lustre/scratch/vsyamaj/legion_s3d_test/gasnet/release/include/gasnet_help.h:1298
#6  _gasnet_AMPoll () at /lustre/scratch/vsyamaj/legion_s3d_test/gasnet/release/include/gasnet_help.h:1511
#7  Realm::GASNetEXPoller::do_work (this=0x45af340, work_until=...) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/gasnetex/gasnetex_internal.cc:2903
#8  0x0000155549488aa8 in Realm::BackgroundWorkManager::Worker::do_work (this=0x1555201129d0, max_time_in_ns=-1, interrupt_flag=0x0) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/bgwork.cc:599
#9  0x00001555494867dd in Realm::BackgroundWorkThread::main_loop (this=0x7af13f0) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/bgwork.cc:103
#10 0x0000155549489e7a in Realm::Thread::thread_entry_wrapper<Realm::BackgroundWorkThread, &Realm::BackgroundWorkThread::main_loop> (obj=0x7af13f0) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/threads.inl:97
#11 0x000015554961c0e3 in Realm::KernelThread::pthread_entry (data=0x7af1490) at /lustre/scratch/vsyamaj/legion_s3d_test/legion/runtime/realm/threads.cc:831
#12 0x0000155552b162de in start_thread () from /lib64/libpthread.so.0
#13 0x0000155552847e83 in clone () from /lib64/libc.so.6

Thread 6 (Thread 0x15550b07d000 (LWP 188241)):
#0  0x000015555283cf21 in poll () from /lib64/libc.so.6
#1  0x00001555473d8b89 in ?? () from /cm/local/apps/cuda/libs/current/lib64/libcuda.so.1
#2  0x000015554747fd7b in ?? () from /cm/local/apps/cuda/libs/current/lib64/libcuda.so.1
#3  0x00001555473d3b98 in ?? () from /cm/local/apps/cuda/libs/current/lib64/libcuda.so.1
#4  0x0000155552b162de in start_thread () from /lib64/libpthread.so.0
#5  0x0000155552847e83 in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x155520318000 (LWP 188237)):
#0  0x000015555283cf21 in poll () from /lib64/libc.so.6
#1  0x00001555473d8b89 in ?? () from /cm/local/apps/cuda/libs/current/lib64/libcuda.so.1
#2  0x000015554747fd7b in ?? () from /cm/local/apps/cuda/libs/current/lib64/libcuda.so.1
#3  0x00001555473d3b98 in ?? () from /cm/local/apps/cuda/libs/current/lib64/libcuda.so.1
#4  0x0000155552b162de in start_thread () from /lib64/libpthread.so.0
#5  0x0000155552847e83 in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x15553ec19000 (LWP 188227)):
#0  0x00001555528481b7 in epoll_wait () from /lib64/libc.so.6
#1  0x00001555459bb0a9 in ucs_event_set_wait () from /lib64/libucs.so.0
#2  0x00001555459aa13c in ucs_async_thread_func () from /lib64/libucs.so.0
#3  0x0000155552b162de in start_thread () from /lib64/libpthread.so.0
#4  0x0000155552847e83 in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x1555447e4000 (LWP 188218)):
#0  0x00001555528481b7 in epoll_wait () from /lib64/libc.so.6
#1  0x0000155546440ba1 in epoll_dispatch (base=0x413c950, tv=<optimized out>) at epoll.c:407
#2  0x0000155546443d7d in opal_libevent2022_event_base_loop (base=0x413c950, flags=1) at event.c:1630
#3  0x000015554652fa6e in progress_engine () from /shared/openmpi-4.0.5/gcc-9.2.0/lib/libopen-pal.so.40
#4  0x0000155552b162de in start_thread () from /lib64/libpthread.so.0
#5  0x0000155552847e83 in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x1555449e5000 (LWP 188217)):
#0  0x000015555283cf21 in poll () from /lib64/libc.so.6
#1  0x000015554644be6d in poll_dispatch (base=0x40eca50, tv=<optimized out>) at poll.c:165
#2  0x0000155546443d7d in opal_libevent2022_event_base_loop (base=0x40eca50, flags=1) at event.c:1630
#3  0x00001555463ee83e in progress_engine () from /shared/openmpi-4.0.5/gcc-9.2.0/lib/libopen-pal.so.40
#4  0x0000155552b162de in start_thread () from /lib64/libpthread.so.0
#5  0x0000155552847e83 in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x155555516000 (LWP 188215)):
#0  0x00001555463e8dd0 in opal_progress () from /shared/openmpi-4.0.5/gcc-9.2.0/lib/libopen-pal.so.40
#1  0x0000155553a17545 in ompi_request_default_wait () from /shared/openmpi-4.0.5/gcc-9.2.0/lib/libmpi.so.40
#2  0x0000155553a8200c in ompi_coll_base_barrier_intra_two_procs () from /shared/openmpi-4.0.5/gcc-9.2.0/lib/libmpi.so.40
#3  0x0000155553a2ea08 in PMPI_Barrier () from /shared/openmpi-4.0.5/gcc-9.2.0/lib/libmpi.so.40
#4  0x0000155553e7b503 in pmpi_barrier__ () from /shared/openmpi-4.0.5/gcc-9.2.0/lib/libmpi_mpifh.so.40
#5  0x00000000005c9ad4 in solve_driver (io=6) at /lustre/scratch/vsyamaj/legion_s3d_test2/s3d/source/drivers/solve_driver.f90:277
#6  0x00000000005c9382 in s3d () at /lustre/scratch/vsyamaj/legion_s3d_test2/s3d/source/drivers/main.f90:131
#7  0x0000000000404f4d in main (argc=<optimized out>, argv=<optimized out>) at /lustre/scratch/vsyamaj/legion_s3d_test2/s3d/source/drivers/main.f90:8
#8  0x000015555276f6a3 in __libc_start_main () from /lib64/libc.so.6
#9  0x0000000000404f8e in _start () at /lustre/scratch/vsyamaj/legion_s3d_test2/s3d/source/drivers/main.f90:8

Im using this commit:

ommit 5c8b346011fa36cc9d7305502aca4474568e7309 (HEAD -> control_replication, origin/control_replication)
Merge: 742542c36 2832cae66
Author: Mike <mebauer@cs.stanford.edu>
Date:   Mon Jan 22 01:18:37 2024 -0800

    Merge branch 'master' into control_replication
lightsighter commented 10 months ago

Make me a reproducer on sapling with hanging processes.

lightsighter commented 10 months ago

Please do this today, you're now blocking the control replication merge to master.

syamajala commented 10 months ago

I'm building on sapling right now. Having some issues though.

lightsighter commented 10 months ago

What issues?

syamajala commented 10 months ago

Missing symbols when I run: /scratch2/seshu/legion_s3d_nscbc/Ammonia_Cases/pwave_x_1_hept/s3d.x: symbol lookup error: /scratch2/seshu/legion_s3d_nscbc//build/hept/libsum_tasks.so: undefined symbol: hijackCudaRegisterFatBinary

lightsighter commented 10 months ago

Make sure you rebuild your full S3D. That symbol no longer exists in the Realm CUDA hijack.

syamajala commented 10 months ago

This was a fresh checkout.

lightsighter commented 10 months ago

That suggests that you are building against one version of Legion and then dynamically loading a different one at runtime.

syamajala commented 10 months ago

I just built without cuda instead. There are processes on c0001: 481927 and 481928.

elliottslaughter commented 10 months ago

For what it's worth, the hijack bits moved into the Regent bindings: https://gitlab.com/StanfordLegion/legion/-/blob/master/bindings/regent/regent_cudart_hijack.cc

lightsighter commented 10 months ago

This looks like you haven't updated the mapper to use the new interface for replicating tasks because the top-level task is not control replicated. You started just one top-level task on node 0 and as a result there is no shard on node 1 to synchronize with MPI.

syamajala commented 10 months ago

I implemented replicate_task in the mapper, are more changes needed?

syamajala commented 10 months ago

I can see select_task_options is marking the task as replicable, but it doesnt look like replicate_task is ever getting called?

lightsighter commented 10 months ago

Where is the implementation of your mapper?

syamajala commented 10 months ago

I have not checked them in yet, so you will have to them on sapling here: /scratch2/seshu/legion_s3d_no_cuda/rhst/rhst_mapper.cc

I mostly just moved the parts that seemed relevant from map_replicate_task to replicate_task.

lightsighter commented 10 months ago

It looks like you're not calling Runtime::set_top_level_task_mapper_id to ensure your mapper gets called for mapping the top-level task.

syamajala commented 10 months ago

How are IDs associated with mappers? I'm calling replace_default_mapper and only see add_mapper takes both an ID and a mapper.

lightsighter commented 10 months ago

That should be fine. How do I run your code?

syamajala commented 10 months ago
cd /scratch2/seshu/legion_s3d_no_cuda/Ammonia_Cases
salloc -N 1 -p cpu --exclusive
./ammonia_job.sh
lightsighter commented 10 months ago

You're setting options.map_locally = true which disables replication as you're not allowed map the shards of a replicated task locally. You would be seeing this warning if you weren't suppressing warnings from Legion. I recommend that you stop suppressing warnings from Legion so you can actually see messages like this.

syamajala commented 10 months ago

Ok. After fixing that issue it looks like its working. Thanks!