Legion: shardrefine shutdown hang

syamajala commented 1 year ago

I'm seeing a shutdown hang with shardrefine. Here are some stack traces:

Thread 15 (Thread 0x155508382c80 (LWP 3433036)):
#0  0x000015555291b6ed in syscall () from /lib64/libc.so.6
#1  0x0000155547bd00c3 in Realm::Doorbell::wait_slow (this=0x155508382770) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/mutex.cc:304
#2  0x000015554797e110 in Realm::Doorbell::wait (this=0x155508382770) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/mutex.inl:81
#3  0x0000155547b1315f in Realm::ThreadedTaskScheduler::WorkCounter::wait_for_work (this=0xcf7d580, old_counter=13039) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/tasks.cc:719
#4  0x0000155547b15455 in Realm::ThreadedTaskScheduler::wait_for_work (this=0xcf7d3c0, old_work_counter=13039) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/tasks.cc:1294
#5  0x0000155547b162fb in Realm::KernelThreadTaskScheduler::wait_for_work (this=0xcf7d3c0, old_work_counter=13039) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/tasks.cc:1528
#6  0x0000155547b152a8 in Realm::ThreadedTaskScheduler::scheduler_loop (this=0xcf7d3c0) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/tasks.cc:1260
#7  0x0000155547b15368 in Realm::ThreadedTaskScheduler::scheduler_loop_wlock (this=0xcf7d3c0) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/tasks.cc:1272
#8  0x0000155547b1c4a4 in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop_wlock> (obj=0xcf7d3c0) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/threads.inl:97
#9  0x0000155547b28367 in Realm::KernelThread::pthread_entry (data=0x154adc139960) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/threads.cc:781
#10 0x0000155552bef2de in start_thread () from /lib64/libpthread.so.0
#11 0x0000155552920e83 in clone () from /lib64/libc.so.6

Thread 14 (Thread 0x155502e78c80 (LWP 3432830)):
#0  0x000015555291b6ed in syscall () from /lib64/libc.so.6
#1  0x0000155547bd00c3 in Realm::Doorbell::wait_slow (this=0x155502e78770) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/mutex.cc:304
#2  0x000015554797e110 in Realm::Doorbell::wait (this=0x155502e78770) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/mutex.inl:81
#3  0x0000155547b1315f in Realm::ThreadedTaskScheduler::WorkCounter::wait_for_work (this=0xcf7fd90, old_counter=13197) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/tasks.cc:719
#4  0x0000155547b15455 in Realm::ThreadedTaskScheduler::wait_for_work (this=0xcf7fbd0, old_work_counter=13197) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/tasks.cc:1294
#5  0x0000155547b162fb in Realm::KernelThreadTaskScheduler::wait_for_work (this=0xcf7fbd0, old_work_counter=13197) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/tasks.cc:1528
#6  0x0000155547b152a8 in Realm::ThreadedTaskScheduler::scheduler_loop (this=0xcf7fbd0) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/tasks.cc:1260
#7  0x0000155547b15368 in Realm::ThreadedTaskScheduler::scheduler_loop_wlock (this=0xcf7fbd0) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/tasks.cc:1272
#8  0x0000155547b1c4a4 in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop_wlock> (obj=0xcf7fbd0) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/threads.inl:97
#9  0x0000155547b28367 in Realm::KernelThread::pthread_entry (data=0x154b0a561810) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/threads.cc:781
#10 0x0000155552bef2de in start_thread () from /lib64/libpthread.so.0
#11 0x0000155552920e83 in clone () from /lib64/libc.so.6

Thread 13 (Thread 0x155508592c80 (LWP 3432577)):
#0  0x000015555291b6ed in syscall () from /lib64/libc.so.6
#1  0x0000155547bd00c3 in Realm::Doorbell::wait_slow (this=0x155508592770) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/mutex.cc:304
#2  0x000015554797e110 in Realm::Doorbell::wait (this=0x155508592770) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/mutex.inl:81
#3  0x0000155547b1315f in Realm::ThreadedTaskScheduler::WorkCounter::wait_for_work (this=0x4d57bc0, old_counter=275876) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/tasks.cc:719
#4  0x0000155547b15455 in Realm::ThreadedTaskScheduler::wait_for_work (this=0x4d57a00, old_work_counter=275876) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/tasks.cc:1294
#5  0x0000155547b16fb3 in Realm::UserThreadTaskScheduler::wait_for_work (this=0x4d57a00, old_work_counter=275876) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/tasks.cc:1795
#6  0x0000155547b152a8 in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x4d57a00) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/tasks.cc:1260
#7  0x0000155547b1c736 in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop> (obj=0x4d57a00) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/threads.inl:97
#8  0x0000155547b29fdd in Realm::UserThread::uthread_entry () at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/threads.cc:1355
#9  0x00001555528722e0 in ?? () from /lib64/libc.so.6
#10 0x0000000000000000 in ?? ()

Thread 12 (Thread 0x15550859ec80 (LWP 3432576)):
#0  0x000015555291b6ed in syscall () from /lib64/libc.so.6
#1  0x0000155547bd00c3 in Realm::Doorbell::wait_slow (this=0x15550859e770) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/mutex.cc:304
#2  0x000015554797e110 in Realm::Doorbell::wait (this=0x15550859e770) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/mutex.inl:81
#3  0x0000155547b1315f in Realm::ThreadedTaskScheduler::WorkCounter::wait_for_work (this=0x4fe4520, old_counter=2070279) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/tasks.cc:719
#4  0x0000155547b15455 in Realm::ThreadedTaskScheduler::wait_for_work (this=0x4fe4360, old_work_counter=2070279) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/tasks.cc:1294
#5  0x0000155547b16fb3 in Realm::UserThreadTaskScheduler::wait_for_work (this=0x4fe4360, old_work_counter=2070279) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/tasks.cc:1795
#6  0x0000155547b152a8 in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x4fe4360) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/tasks.cc:1260
#7  0x0000155547b1c736 in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop> (obj=0x4fe4360) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/threads.inl:97
#8  0x0000155547b29fdd in Realm::UserThread::uthread_entry () at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/threads.cc:1355
#9  0x00001555528722e0 in ?? () from /lib64/libc.so.6
#10 0x0000000000000000 in ?? ()

Thread 11 (Thread 0x1555085aac80 (LWP 3432575)):
#0  0x00001555475e4ec0 in Realm::atomic<unsigned int>::compare_exchange (this=0x154b1a7f70f0, expected=@0x154ac81fe6e8: 268435456, newval=134217728) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/atomics.inl:154
#1  0x0000155547abbd8c in Realm::FastReservation::wrlock_slow (this=0x154b1a7f70f0, mode=Realm::FastReservation::SPIN) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/rsrv_impl.cc:920
#2  0x000015554b14f184 in Realm::FastReservation::wrlock (this=0x154b1a7f70f0, mode=Realm::FastReservation::SPIN) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/reservation.inl:70
#3  0x000015554b14f437 in Legion::Internal::LocalLock::wrlock (this=0x154b1a7f70f0) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/legion_types.h:2786
#4  0x000015554b14f573 in Legion::Internal::AutoLock::AutoLock (this=0x154ac81fea40, r=..., mode=0, excl=true) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/legion_types.h:2818
#5  0x000015554b882b67 in Legion::Internal::VirtualChannel::package_message (this=0x154b1a7f70f0, rez=..., k=Legion::Internal::DISTRIBUTED_DOWNGRADE_RESPONSE, flush=true, flush_precondition=..., runtime=0x1554b9fe27e0, target=..., response=true, shutdown=false) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/runtime.cc:11418
#6  0x000015554b8ec79f in Legion::Internal::MessageManager::send_message<(Legion::Internal::MessageKind)73> (this=0x154b1827c980, rez=..., flush=true, response=true, shutdown=false, flush_precondition=...) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/runtime.cc:13479
#7  0x000015554b8ad931 in Legion::Internal::Runtime::send_did_downgrade_response (this=0x1554b9fe27e0, target=4, rez=...) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/runtime.cc:22042
#8  0x000015554b14c949 in Legion::Internal::DistributedCollectable::check_for_downgrade (this=0x154b0ea43300, owner=4) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/garbage_collection.cc:1008
#9  0x000015554b14ce4f in Legion::Internal::DistributedCollectable::process_downgrade_request (this=0x154b0ea43300, owner=4, to_check=Legion::Internal::DistributedCollectable::GLOBAL_REF_STATE) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/garbage_collection.cc:1099
#10 0x000015554b14cd26 in Legion::Internal::DistributedCollectable::handle_downgrade_request (runtime=0x1554b9fe27e0, derez=..., source=4) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/garbage_collection.cc:1077
#11 0x000015554b8b373b in Legion::Internal::Runtime::handle_did_downgrade_request (this=0x1554b9fe27e0, derez=..., source=4) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/runtime.cc:24721
#12 0x000015554b8848ac in Legion::Internal::VirtualChannel::handle_messages (this=0x154b1a7f7290, num_messages=1, runtime=0x1554b9fe27e0, remote_address_space=4, args=0x1548e8319200 "", arglen=32) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/runtime.cc:12285
#13 0x000015554b883a18 in Legion::Internal::VirtualChannel::process_message (this=0x154b1a7f7290, args=0x1548e83191e4, arglen=52, runtime=0x1554b9fe27e0, remote_address_space=4) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/runtime.cc:11746
#14 0x000015554b8860be in Legion::Internal::MessageManager::receive_message (this=0x154b1827c980, args=0x1548e83191e0, arglen=60) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/runtime.cc:13492
#15 0x000015554b8b7ab0 in Legion::Internal::Runtime::process_message_task (this=0x1554b9fe27e0, args=0x1548e83191dc, arglen=64) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/runtime.cc:26564
#16 0x000015554b8cd49b in Legion::Internal::Runtime::legion_runtime_task (args=0x1548e83191d0, arglen=68, userdata=0x1554b9ffbfc0, userlen=8, p=...) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/runtime.cc:32361
#17 0x0000155547a9d26c in Realm::LocalTaskProcessor::execute_task (this=0xd24a3c0, func_id=4, task_args=...) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/proc_impl.cc:1175
#18 0x0000155547b11f9a in Realm::Task::execute_on_processor (this=0x1548e8318cb0, p=...) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/tasks.cc:326
#19 0x0000155547b16cbe in Realm::UserThreadTaskScheduler::execute_task (this=0xae35c10, task=0x1548e8318cb0) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/tasks.cc:1687
#20 0x0000155547b14d45 in Realm::ThreadedTaskScheduler::scheduler_loop (this=0xae35c10) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/tasks.cc:1160
#21 0x0000155547b1c736 in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop> (obj=0xae35c10) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/threads.inl:97
#22 0x0000155547b29fdd in Realm::UserThread::uthread_entry () at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/threads.cc:1355
#23 0x00001555528722e0 in ?? () from /lib64/libc.so.6
#24 0x0000000000000000 in ?? ()

Thread 10 (Thread 0x1555085b6c80 (LWP 3432574)):
#0  0x00001555528ed238 in nanosleep () from /lib64/libc.so.6
#1  0x0000155552918d78 in usleep () from /lib64/libc.so.6
#2  0x000015554b88370b in Legion::Internal::VirtualChannel::confirm_shutdown (this=0x154b1a7f70f0, shutdown_manager=0x1548f101abd0, phase_one=true) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/runtime.cc:11657
#3  0x000015554b88610c in Legion::Internal::MessageManager::confirm_shutdown (this=0x154b1827c980, shutdown_manager=0x1548f101abd0, phase_one=true) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/runtime.cc:13502
#4  0x000015554b8bbc94 in Legion::Internal::Runtime::confirm_runtime_shutdown (this=0x1554b9fe27e0, shutdown_manager=0x1548f101abd0, phase_one=true) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/runtime.cc:27638
#5  0x000015554b886645 in Legion::Internal::ShutdownManager::finalize (this=0x1548f101abd0) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/runtime.cc:13612
#6  0x000015554b8865cb in Legion::Internal::ShutdownManager::handle_response (this=0x1548f101abd0, code=0, success=true, to_add=...) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/runtime.cc:13601
#7  0x000015554b886f5e in Legion::Internal::ShutdownManager::handle_shutdown_response (derez=...) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/runtime.cc:13803
#8  0x000015554b8b71de in Legion::Internal::Runtime::handle_shutdown_response (this=0x1554b9fe27e0, derez=...) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/runtime.cc:26336
#9  0x000015554b885c67 in Legion::Internal::VirtualChannel::handle_messages (this=0x154b182014c0, num_messages=1, runtime=0x1554b9fe27e0, remote_address_space=1, args=0x1548f5f47630 "\320\253\001\361H\025", arglen=24) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/runtime.cc:13361
#10 0x000015554b883a18 in Legion::Internal::VirtualChannel::process_message (this=0x154b182014c0, args=0x1548f5f47614, arglen=44, runtime=0x1554b9fe27e0, remote_address_space=1) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/runtime.cc:11746
#11 0x000015554b8860be in Legion::Internal::MessageManager::receive_message (this=0x154b182012f0, args=0x1548f5f47610, arglen=52) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/runtime.cc:13492
#12 0x000015554b8b7ab0 in Legion::Internal::Runtime::process_message_task (this=0x1554b9fe27e0, args=0x1548f5f4760c, arglen=56) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/runtime.cc:26564
#13 0x000015554b8cd49b in Legion::Internal::Runtime::legion_runtime_task (args=0x1548f5f47600, arglen=60, userdata=0x1554b9ffbd40, userlen=8, p=...) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/runtime.cc:32361
#14 0x0000155547a9d26c in Realm::LocalTaskProcessor::execute_task (this=0xd249fe0, func_id=4, task_args=...) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/proc_impl.cc:1175
#15 0x0000155547b11f9a in Realm::Task::execute_on_processor (this=0x1548f5f47380, p=...) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/tasks.cc:326
#16 0x0000155547b16cbe in Realm::UserThreadTaskScheduler::execute_task (this=0xae42900, task=0x1548f5f47380) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/tasks.cc:1687
#17 0x0000155547b14d45 in Realm::ThreadedTaskScheduler::scheduler_loop (this=0xae42900) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/tasks.cc:1160
#18 0x0000155547b1c736 in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop> (obj=0xae42900) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/threads.inl:97
#19 0x0000155547b29fdd in Realm::UserThread::uthread_entry () at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/threads.cc:1355
#20 0x00001555528722e0 in ?? () from /lib64/libc.so.6
#21 0x0000000000000000 in ?? ()

Thread 9 (Thread 0x1555086bac80 (LWP 3432398)):
#0  0x0000155547bad756 in std::move<Realm::GASNetEXEvent*&> (__t=@0x4561478: 0x0) at /cm/local/apps/gcc/9.2.0/include/c++/9.2.0/bits/move.h:99
#1  0x0000155547bad7a5 in std::swap<Realm::GASNetEXEvent*> (__a=@0x1555086b8698: 0x0, __b=@0x4561478: 0x0) at /cm/local/apps/gcc/9.2.0/include/c++/9.2.0/bits/move.h:195
#2  0x0000155547bab7ff in Realm::IntrusiveList<Realm::GASNetEXEvent, &Realm::GASNetEXEvent::event_list_link, Realm::DummyLock>::swap<Realm::DummyLock> (this=0x1555086b8690, swap_with=...) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/lists.inl:93
#3  0x0000155547ba18cd in Realm::GASNetEXPoller::do_work (this=0x4561370, work_until=...) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/gasnetex/gasnetex_internal.cc:2810
#4  0x000015554797dd36 in Realm::BackgroundWorkManager::Worker::do_work (this=0x1555086b90d0, max_time_in_ns=-1, interrupt_flag=0x0) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/bgwork.cc:621
#5  0x000015554797ba6b in Realm::BackgroundWorkThread::main_loop (this=0xd249bc0) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/bgwork.cc:125
#6  0x000015554797f14a in Realm::Thread::thread_entry_wrapper<Realm::BackgroundWorkThread, &Realm::BackgroundWorkThread::main_loop> (obj=0xd249bc0) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/threads.inl:97
#7  0x0000155547b28367 in Realm::KernelThread::pthread_entry (data=0xd249c60) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/threads.cc:781
#8  0x0000155552bef2de in start_thread () from /lib64/libpthread.so.0
#9  0x0000155552920e83 in clone () from /lib64/libc.so.6

Thread 8 (Thread 0x1555087bec80 (LWP 3432397)):
#0  0x000015555291b6ed in syscall () from /lib64/libc.so.6
#1  0x0000155547bd00c3 in Realm::Doorbell::wait_slow (this=0x1555087be770) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/mutex.cc:304
#2  0x000015554797e110 in Realm::Doorbell::wait (this=0x1555087be770) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/mutex.inl:81
#3  0x000015554797bbe3 in Realm::BackgroundWorkThread::main_loop (this=0xd249930) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/bgwork.cc:166
#4  0x000015554797f14a in Realm::Thread::thread_entry_wrapper<Realm::BackgroundWorkThread, &Realm::BackgroundWorkThread::main_loop> (obj=0xd249930) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/threads.inl:97
#5  0x0000155547b28367 in Realm::KernelThread::pthread_entry (data=0xd249a70) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/threads.cc:781
#6  0x0000155552bef2de in start_thread () from /lib64/libpthread.so.0
#7  0x0000155552920e83 in clone () from /lib64/libc.so.6

Thread 7 (Thread 0x1555089e4700 (LWP 3432388)):
#0  0x0000155552915f21 in poll () from /lib64/libc.so.6
#1  0x00001555455f2b89 in ?? () from /cm/local/apps/cuda/libs/current/lib64/libcuda.so.1
#2  0x0000155545699d7b in ?? () from /cm/local/apps/cuda/libs/current/lib64/libcuda.so.1
#3  0x00001555455edb98 in ?? () from /cm/local/apps/cuda/libs/current/lib64/libcuda.so.1
#4  0x0000155552bef2de in start_thread () from /lib64/libpthread.so.0
#5  0x0000155552920e83 in clone () from /lib64/libc.so.6

Thread 6 (Thread 0x155508bf5700 (LWP 3432387)):
#0  0x0000155552915f21 in poll () from /lib64/libc.so.6
#1  0x00001555455f2b89 in ?? () from /cm/local/apps/cuda/libs/current/lib64/libcuda.so.1
#2  0x0000155545699d7b in ?? () from /cm/local/apps/cuda/libs/current/lib64/libcuda.so.1
#3  0x00001555455edb98 in ?? () from /cm/local/apps/cuda/libs/current/lib64/libcuda.so.1
#4  0x0000155552bef2de in start_thread () from /lib64/libpthread.so.0
#5  0x0000155552920e83 in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x155508df6700 (LWP 3432386)):
#0  0x0000155552915f21 in poll () from /lib64/libc.so.6
#1  0x00001555455f2b89 in ?? () from /cm/local/apps/cuda/libs/current/lib64/libcuda.so.1
#2  0x0000155545699d7b in ?? () from /cm/local/apps/cuda/libs/current/lib64/libcuda.so.1
#3  0x00001555455edb98 in ?? () from /cm/local/apps/cuda/libs/current/lib64/libcuda.so.1
#4  0x0000155552bef2de in start_thread () from /lib64/libpthread.so.0
#5  0x0000155552920e83 in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x155541513700 (LWP 3432377)):
#0  0x00001555529211b7 in epoll_wait () from /lib64/libc.so.6
#1  0x0000155543bbf0a9 in ucs_event_set_wait () from /lib64/libucs.so.0
#2  0x0000155543bae13c in ucs_async_thread_func () from /lib64/libucs.so.0
#3  0x0000155552bef2de in start_thread () from /lib64/libpthread.so.0
#4  0x0000155552920e83 in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x155542c8a700 (LWP 3432373)):
#0  0x00001555529211b7 in epoll_wait () from /lib64/libc.so.6
#1  0x00001555446516e1 in epoll_dispatch (base=0x41494d0, tv=<optimized out>) at epoll.c:407
#2  0x00001555446548bd in opal_libevent2022_event_base_loop (base=0x41494d0, flags=1) at event.c:1630
#3  0x00001555447405ae in progress_engine () from /shared/openmpi-4.0.5-cuda/gcc-9.2.0/lib/libopen-pal.so.40
#4  0x0000155552bef2de in start_thread () from /lib64/libpthread.so.0
#5  0x0000155552920e83 in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x155542e8b700 (LWP 3432372)):
#0  0x0000155552915f21 in poll () from /lib64/libc.so.6
#1  0x000015554465c9ad in poll_dispatch (base=0x40f9530, tv=<optimized out>) at poll.c:165
#2  0x00001555446548bd in opal_libevent2022_event_base_loop (base=0x40f9530, flags=1) at event.c:1630
#3  0x00001555445f475e in progress_engine () from /shared/openmpi-4.0.5-cuda/gcc-9.2.0/lib/libopen-pal.so.40
#4  0x0000155552bef2de in start_thread () from /lib64/libpthread.so.0
#5  0x0000155552920e83 in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x155555516c80 (LWP 3432371)):
#0  0x000015555291b6ed in syscall () from /lib64/libc.so.6
#1  0x0000155547bd00c3 in Realm::Doorbell::wait_slow (this=0x155555516770) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/mutex.cc:304
#2  0x000015554797e110 in Realm::Doorbell::wait (this=0x155555516770) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/mutex.inl:81
#3  0x0000155547bd11f8 in Realm::UnfairCondVar::wait (this=0x455ff50) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/mutex.cc:1005
#4  0x0000155547acde54 in Realm::RuntimeImpl::wait_for_shutdown (this=0x455fd30) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/runtime_impl.cc:2439
#5  0x0000155547ac530a in Realm::Runtime::wait_for_shutdown (this=0x7fffffff9df8) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/runtime_impl.cc:689
#6  0x000015554b8cb0cd in Legion::Internal::Runtime::wait_for_shutdown () at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/runtime.cc:31481
#7  0x000015554b44c179 in Legion::Runtime::wait_for_shutdown () at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/legion.cc:7691
#8  0x000015555480d8e1 in shutdown_rhsf_legion_ () from /lustre/scratch/vsyamaj/legion_s3d_subranks/Ammonia_Cases/pwave_x_16_hept/librhsf.so
#9  0x00000000005ca3e5 in solve_driver (io=6) at /lustre/scratch/vsyamaj/legion_s3d_test/s3d/source/drivers/solve_driver.f90:466
#10 0x00000000005c9392 in s3d () at /lustre/scratch/vsyamaj/legion_s3d_test/s3d/source/drivers/main.f90:131
#11 0x0000000000404f5d in main (argc=<optimized out>, argv=<optimized out>) at /lustre/scratch/vsyamaj/legion_s3d_test/s3d/source/drivers/main.f90:8
#12 0x00001555528486a3 in __libc_start_main () from /lib64/libc.so.6
#13 0x0000000000404f9e in _start () at /lustre/scratch/vsyamaj/legion_s3d_test/s3d/source/drivers/main.f90:8
quit
Detaching from program: /lustre/scratch/vsyamaj/legion_s3d_subranks/Ammonia_Cases/pwave_x_16_hept/s3d.x, process 3432371
[Inferior 1 (process 3432371) detached]

lightsighter commented 1 year ago

What is the did of the DistributedCollectable in frame 8 of thread 11?

lightsighter commented 1 year ago

Also, what commit of shardrefine are you on?

lightsighter commented 1 year ago

To be very clear this is not a hang, this is a livelock, and your stack traces should continue to be changing.

syamajala commented 1 year ago

The commit is:

commit c4ff5e0d1bb1e01b1b481bb934b4a8b15d36513e (HEAD -> shardrefine, origin/shardrefine)
Author: Mike Bauer <mike@lightsighter.org>
Date:   Sat Aug 19 18:07:19 2023 -0700

    legion: fixes for logical analysis of refinements

Running again it does appear the stack traces are changing. I don't see any threads with a DistributedCollectable when I run it again.

I'm can't seem to run S3D on sapling right now, I see processes dying at start up every time I run and then the node goes into a drained state in slurm and I have to reboot to run again. This problem has been intermittent on sapling.

syamajala commented 1 year ago

I was able to get it to run on sapling. It only starts to appears when at 8 ranks.

There are some processes here on c0001:

lightsighter commented 1 year ago

This is not hanging the same way that the backtraces above are. What is the output of running with -level shutdown=2?

syamajala commented 1 year ago

It looks like rank 0 shuts down but the others dont? Here is the last 20 lines from each log:

==> run_0.log <==
For more information see:
http://legion.stanford.edu/messages/warning_code.html#warning_code_1071

[0 - 7f2737b8cc40]   19.244778 {4}{runtime}: [warning 1071] LEGION WARNING: Region requirement 4 of operation Sum4IntegrateTaskFused (UID 21008, provenance: launch.rg:143) in parent task main (UID 24) is using uninitialized data for field(s) 123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139 of logical region (272,3,3) (from file /scratch2/seshu/legion_s3d_subranks/legion/runtime/legion/legion_ops.cc:1822)
For more information see:
http://legion.stanford.edu/messages/warning_code.html#warning_code_1071

[0 - 7f2737b8cc40]   78.553474 {2}{shutdown}: Received notification on node 0 for phase 1
[0 - 7f2737b8cc40]   78.557566 {2}{shutdown}: SHUTDOWN PHASE 1 SUCCESS!
[0 - 7f2737b8cc40]   78.557580 {2}{shutdown}: Received notification on node 0 for phase 2
[0 - 7f2737b8cc40]   78.558263 {2}{shutdown}: SHUTDOWN PHASE 2 SUCCESS!
[0 - 7f2737b8cc40]   78.605108 {2}{shutdown}: Received notification on node 0 for phase 3
[0 - 7f2737b8cc40]   79.058921 {2}{shutdown}: FAILED SHUTDOWN PHASE 3! Trying again...
[0 - 7f2737b8cc40]   79.309960 {2}{shutdown}: Received notification on node 0 for phase 3
[0 - 7f2737b8cc40]   79.318945 {2}{shutdown}: FAILED SHUTDOWN PHASE 3! Trying again...
[0 - 7f2737b8cc40]   79.319043 {2}{shutdown}: Received notification on node 0 for phase 3
[0 - 7f2737b8cc40]   79.319764 {2}{shutdown}: SHUTDOWN PHASE 3 SUCCESS!
[0 - 7f2737b8cc40]   79.319776 {2}{shutdown}: Received notification on node 0 for phase 4
[0 - 7f2737b8cc40]   79.321480 {2}{shutdown}: SHUTDOWN PHASE 4 SUCCESS!
[0 - 7f2737b8cc40]   79.321491 {2}{shutdown}: SHUTDOWN SUCCEEDED!

==> run_1.log <==
http://legion.stanford.edu/messages/warning_code.html#warning_code_1071

[1 - 7fc1ea841c40]   21.735396 {4}{runtime}: [warning 1071] LEGION WARNING: Region requirement 4 of operation Sum4IntegrateTaskFused (UID 22793, provenance: launch.rg:143) in parent task main (UID 1) is using uninitialized data for field(s) 140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156 of logical region (265,3,3) (from file /scratch2/seshu/legion_s3d_subranks/legion/runtime/legion/legion_ops.cc:1822)
For more information see:
http://legion.stanford.edu/messages/warning_code.html#warning_code_1071

[1 - 7fc1ea841c40]   78.554181 {2}{shutdown}: Received notification on node 1 for phase 1
[1 - 7fc1ea841c40]   78.556380 {2}{shutdown}: Received notification on node 1 for phase 2
[1 - 7fc1ea841c40]   78.603794 {2}{shutdown}: Received notification on node 1 for phase 3
[1 - 7fc1ea841c40]   78.665194 {2}{shutdown}: Pending message on node 1
[1 - 7fc1ea841c40]   78.666287 {2}{shutdown}: Pending message on node 1
[1 - 7fc1ea841c40]   78.667397 {2}{shutdown}: Pending message on node 1
[1 - 7fc1ea841c40]   78.668509 {2}{shutdown}: Pending message on node 1
[1 - 7fc1ea841c40]   78.669613 {2}{shutdown}: Pending message on node 1
[1 - 7fc1ea841c40]   78.670706 {2}{shutdown}: Pending message on node 1
[1 - 7fc1ea841c40]   78.671813 {2}{shutdown}: Pending message on node 1
[1 - 7fc1ea841c40]   78.672910 {2}{shutdown}: Pending message on node 1
[1 - 7fc1ea841c40]   79.315712 {2}{shutdown}: Received notification on node 1 for phase 3
[1 - 7fc1ea841c40]   79.317701 {2}{shutdown}: Received notification on node 1 for phase 3
[1 - 7fc1ea841c40]   79.318430 {2}{shutdown}: Received notification on node 1 for phase 4

==> run_2.log <==

[2 - 7f3fda49ac40]   19.264086 {4}{runtime}: [warning 1071] LEGION WARNING: Region requirement 4 of operation Sum4IntegrateTaskFused (UID 20954, provenance: launch.rg:143) in parent task main (UID 2) is using uninitialized data for field(s) 123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139 of logical region (226,3,3) (from file /scratch2/seshu/legion_s3d_subranks/legion/runtime/legion/legion_ops.cc:1822)
For more information see:
http://legion.stanford.edu/messages/warning_code.html#warning_code_1071

[2 - 7f3fda49ac40]   78.553070 {2}{shutdown}: Received notification on node 2 for phase 1
[2 - 7f3fda49ac40]   78.555253 {2}{shutdown}: Received notification on node 2 for phase 2
[2 - 7f3fda49ac40]   78.602783 {2}{shutdown}: Received notification on node 2 for phase 3
[2 - 7f3fda49ac40]   78.646867 {2}{shutdown}: Pending message on node 2
[2 - 7f3fda49ac40]   78.647954 {2}{shutdown}: Pending message on node 2
[2 - 7f3fda49ac40]   78.649048 {2}{shutdown}: Pending message on node 2
[2 - 7f3fda49ac40]   78.650144 {2}{shutdown}: Pending message on node 2
[2 - 7f3fda49ac40]   78.651251 {2}{shutdown}: Pending message on node 2
[2 - 7f3fda49ac40]   78.652348 {2}{shutdown}: Pending message on node 2
[2 - 7f3fda49ac40]   78.653438 {2}{shutdown}: Pending message on node 2
[2 - 7f3fda49ac40]   78.654536 {2}{shutdown}: Pending message on node 2
[2 - 7f3fda49ac40]   79.307822 {2}{shutdown}: Received notification on node 2 for phase 3
[2 - 7f3fda49ac40]   79.310020 {2}{shutdown}: Pending message on node 2
[2 - 7f3fda49ac40]   79.316688 {2}{shutdown}: Received notification on node 2 for phase 3
[2 - 7f3fda49ac40]   79.317413 {2}{shutdown}: Received notification on node 2 for phase 4

==> run_3.log <==
http://legion.stanford.edu/messages/warning_code.html#warning_code_1071

[3 - 7f010f3dfc40]   19.262353 {4}{runtime}: [warning 1071] LEGION WARNING: Region requirement 4 of operation Sum4IntegrateTaskFused (UID 20963, provenance: launch.rg:143) in parent task main (UID 3) is using uninitialized data for field(s) 123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139 of logical region (227,3,3) (from file /scratch2/seshu/legion_s3d_subranks/legion/runtime/legion/legion_ops.cc:1822)
For more information see:
http://legion.stanford.edu/messages/warning_code.html#warning_code_1071

[3 - 7f010f3dfc40]   78.555201 {2}{shutdown}: Received notification on node 3 for phase 1
[3 - 7f010f3dfc40]   78.557366 {2}{shutdown}: Received notification on node 3 for phase 2
[3 - 7f010f3dfc40]   78.604905 {2}{shutdown}: Received notification on node 3 for phase 3
[3 - 7f010f3dfc40]   78.651860 {2}{shutdown}: Pending message on node 3
[3 - 7f010f3dfc40]   78.652946 {2}{shutdown}: Pending message on node 3
[3 - 7f010f3dfc40]   78.654041 {2}{shutdown}: Pending message on node 3
[3 - 7f010f3dfc40]   78.655147 {2}{shutdown}: Pending message on node 3
[3 - 7f010f3dfc40]   78.656237 {2}{shutdown}: Pending message on node 3
[3 - 7f010f3dfc40]   78.657323 {2}{shutdown}: Pending message on node 3
[3 - 7f010f3dfc40]   78.658416 {2}{shutdown}: Pending message on node 3
[3 - 7f010f3dfc40]   78.659510 {2}{shutdown}: Pending message on node 3
[3 - 7f010f3dfc40]   79.309802 {2}{shutdown}: Received notification on node 3 for phase 3
[3 - 7f010f3dfc40]   79.318817 {2}{shutdown}: Received notification on node 3 for phase 3
[3 - 7f010f3dfc40]   79.319548 {2}{shutdown}: Received notification on node 3 for phase 4

==> run_4.log <==
http://legion.stanford.edu/messages/warning_code.html#warning_code_1071

[4 - 7f62a7988c40]   19.263172 {4}{runtime}: [warning 1071] LEGION WARNING: Region requirement 4 of operation Sum4IntegrateTaskFused (UID 20900, provenance: launch.rg:143) in parent task main (UID 4) is using uninitialized data for field(s) 123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139 of logical region (228,3,3) (from file /scratch2/seshu/legion_s3d_subranks/legion/runtime/legion/legion_ops.cc:1822)
For more information see:
http://legion.stanford.edu/messages/warning_code.html#warning_code_1071

[4 - 7f62a7988c40]   78.554426 {2}{shutdown}: Received notification on node 4 for phase 1
[4 - 7f62a7988c40]   78.556610 {2}{shutdown}: Received notification on node 4 for phase 2
[4 - 7f62a7988c40]   78.604149 {2}{shutdown}: Received notification on node 4 for phase 3
[4 - 7f62a7988c40]   78.648378 {2}{shutdown}: Pending message on node 4
[4 - 7f62a7988c40]   78.649463 {2}{shutdown}: Pending message on node 4
[4 - 7f62a7988c40]   78.650560 {2}{shutdown}: Pending message on node 4
[4 - 7f62a7988c40]   78.651630 {2}{shutdown}: Pending message on node 4
[4 - 7f62a7988c40]   78.652726 {2}{shutdown}: Pending message on node 4
[4 - 7f62a7988c40]   78.653818 {2}{shutdown}: Pending message on node 4
[4 - 7f62a7988c40]   78.654918 {2}{shutdown}: Pending message on node 4
[4 - 7f62a7988c40]   78.656023 {2}{shutdown}: Pending message on node 4
[4 - 7f62a7988c40]   79.309021 {2}{shutdown}: Received notification on node 4 for phase 3
[4 - 7f62a7988c40]   79.318052 {2}{shutdown}: Received notification on node 4 for phase 3
[4 - 7f62a7988c40]   79.318783 {2}{shutdown}: Received notification on node 4 for phase 4

==> run_5.log <==
http://legion.stanford.edu/messages/warning_code.html#warning_code_1071

[5 - 7f349e602c40]   19.255487 {4}{runtime}: [warning 1071] LEGION WARNING: Region requirement 4 of operation Sum4IntegrateTaskFused (UID 20933, provenance: launch.rg:143) in parent task main (UID 5) is using uninitialized data for field(s) 123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139 of logical region (229,3,3) (from file /scratch2/seshu/legion_s3d_subranks/legion/runtime/legion/legion_ops.cc:1822)
For more information see:
http://legion.stanford.edu/messages/warning_code.html#warning_code_1071

[5 - 7f349e602c40]   78.554366 {2}{shutdown}: Received notification on node 5 for phase 1
[5 - 7f349e602c40]   78.556555 {2}{shutdown}: Received notification on node 5 for phase 2
[5 - 7f349e602c40]   78.604089 {2}{shutdown}: Received notification on node 5 for phase 3
[5 - 7f349e602c40]   78.648025 {2}{shutdown}: Pending message on node 5
[5 - 7f349e602c40]   78.649135 {2}{shutdown}: Pending message on node 5
[5 - 7f349e602c40]   78.650239 {2}{shutdown}: Pending message on node 5
[5 - 7f349e602c40]   78.651332 {2}{shutdown}: Pending message on node 5
[5 - 7f349e602c40]   78.652416 {2}{shutdown}: Pending message on node 5
[5 - 7f349e602c40]   78.653509 {2}{shutdown}: Pending message on node 5
[5 - 7f349e602c40]   78.654599 {2}{shutdown}: Pending message on node 5
[5 - 7f349e602c40]   78.655686 {2}{shutdown}: Pending message on node 5
[5 - 7f349e602c40]   79.308970 {2}{shutdown}: Received notification on node 5 for phase 3
[5 - 7f349e602c40]   79.317994 {2}{shutdown}: Received notification on node 5 for phase 3
[5 - 7f349e602c40]   79.318728 {2}{shutdown}: Received notification on node 5 for phase 4

==> run_6.log <==
http://legion.stanford.edu/messages/warning_code.html#warning_code_1071

[6 - 7fe783654c40]   19.282687 {4}{runtime}: [warning 1071] LEGION WARNING: Region requirement 4 of operation Sum4IntegrateTaskFused (UID 20966, provenance: launch.rg:143) in parent task main (UID 6) is using uninitialized data for field(s) 123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139 of logical region (230,3,3) (from file /scratch2/seshu/legion_s3d_subranks/legion/runtime/legion/legion_ops.cc:1822)
For more information see:
http://legion.stanford.edu/messages/warning_code.html#warning_code_1071

[6 - 7fe783654c40]   78.554743 {2}{shutdown}: Received notification on node 6 for phase 1
[6 - 7fe783654c40]   78.556919 {2}{shutdown}: Received notification on node 6 for phase 2
[6 - 7fe783654c40]   78.604456 {2}{shutdown}: Received notification on node 6 for phase 3
[6 - 7fe783654c40]   78.648362 {2}{shutdown}: Pending message on node 6
[6 - 7fe783654c40]   78.649463 {2}{shutdown}: Pending message on node 6
[6 - 7fe783654c40]   78.650548 {2}{shutdown}: Pending message on node 6
[6 - 7fe783654c40]   78.651650 {2}{shutdown}: Pending message on node 6
[6 - 7fe783654c40]   78.652738 {2}{shutdown}: Pending message on node 6
[6 - 7fe783654c40]   78.653835 {2}{shutdown}: Pending message on node 6
[6 - 7fe783654c40]   78.654937 {2}{shutdown}: Pending message on node 6
[6 - 7fe783654c40]   78.656037 {2}{shutdown}: Pending message on node 6
[6 - 7fe783654c40]   79.309339 {2}{shutdown}: Received notification on node 6 for phase 3
[6 - 7fe783654c40]   79.318367 {2}{shutdown}: Received notification on node 6 for phase 3
[6 - 7fe783654c40]   79.319098 {2}{shutdown}: Received notification on node 6 for phase 4

==> run_7.log <==

[7 - 7f0d18789c40]   19.269394 {4}{runtime}: [warning 1071] LEGION WARNING: Region requirement 4 of operation Sum4IntegrateTaskFused (UID 20943, provenance: launch.rg:143) in parent task main (UID 7) is using uninitialized data for field(s) 123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139 of logical region (231,3,3) (from file /scratch2/seshu/legion_s3d_subranks/legion/runtime/legion/legion_ops.cc:1822)
For more information see:
http://legion.stanford.edu/messages/warning_code.html#warning_code_1071

[7 - 7f0d18789c40]   78.553494 {2}{shutdown}: Received notification on node 7 for phase 1
[7 - 7f0d18789c40]   78.555683 {2}{shutdown}: Received notification on node 7 for phase 2
[7 - 7f0d18789c40]   78.603223 {2}{shutdown}: Received notification on node 7 for phase 3
[7 - 7f0d18789c40]   78.647859 {2}{shutdown}: Pending message on node 7
[7 - 7f0d18789c40]   78.648946 {2}{shutdown}: Pending message on node 7
[7 - 7f0d18789c40]   78.650048 {2}{shutdown}: Pending message on node 7
[7 - 7f0d18789c40]   78.651143 {2}{shutdown}: Pending message on node 7
[7 - 7f0d18789c40]   78.652244 {2}{shutdown}: Pending message on node 7
[7 - 7f0d18789c40]   78.653349 {2}{shutdown}: Pending message on node 7
[7 - 7f0d18789c40]   78.654432 {2}{shutdown}: Pending message on node 7
[7 - 7f0d18789c40]   78.655527 {2}{shutdown}: Pending message on node 7
[7 - 7f0d18789c40]   79.308241 {2}{shutdown}: Received notification on node 7 for phase 3
[7 - 7f0d18789c40]   79.309373 {2}{shutdown}: Pending message on node 7
[7 - 7f0d18789c40]   79.317121 {2}{shutdown}: Received notification on node 7 for phase 3
[7 - 7f0d18789c40]   79.317851 {2}{shutdown}: Received notification on node 7 for phase 4

lightsighter commented 1 year ago

FWIW, this really strange, it looks like the shutdown process finished in Legion and something is just not shutting down afterwards, but we've at least called Realm shutdown at this point. This is definitely very different than the other shutdown "hang" that is referenced at the beginning of the issue.

streichler commented 1 year ago

Do we have backtraces for this new form of hang?

syamajala commented 1 year ago

It could be that we are seeing two different issues, sapling vs blaze. The original stack traces above were from blaze and everything since then has been on sapling.

@lightsighter to run it yourself do:

salloc -N 1 -p cpu --exclusive
cd /scratch2/seshu/legion_s3d_subranks/Ammonia_Cases
./ammonia_job.sh

I will try -level shutdown=2 on blaze and see what that looks like.

syamajala commented 1 year ago

On blaze I'm seeing a lot of stuff like this.

run_0.log:

[0 - 15550859ec80]   60.375449 {2}{shutdown}: Received notification on node 0 for phase 1
[0 - 1555085b6c80]   60.428166 {2}{shutdown}: FAILED SHUTDOWN PHASE 1! Trying again...
[0 - 1555085b6c80]   60.428358 {2}{shutdown}: Received notification on node 0 for phase 1
[0 - 1555085b6c80]   60.446234 {2}{shutdown}: SHUTDOWN PHASE 1 SUCCESS!
[0 - 1555085b6c80]   60.446248 {2}{shutdown}: Received notification on node 0 for phase 2
[0 - 1555085b6c80]   60.452600 {2}{shutdown}: Outstanding message on node 0
[0 - 1555085b6c80]   60.452736 {2}{shutdown}: Outstanding message on node 0
[0 - 1555085b6c80]   60.452792 {2}{shutdown}: FAILED SHUTDOWN PHASE 2! Trying again...
[0 - 1555085aac80]   60.452821 {2}{shutdown}: Received notification on node 0 for phase 1
[0 - 1555085b6c80]   60.470200 {2}{shutdown}: FAILED SHUTDOWN PHASE 1! Trying again...
[0 - 1555085b6c80]   60.470388 {2}{shutdown}: Received notification on node 0 for phase 1
[0 - 1555085aac80]   60.486770 {2}{shutdown}: SHUTDOWN PHASE 1 SUCCESS!
[0 - 1555085aac80]   60.486781 {2}{shutdown}: Received notification on node 0 for phase 2
[0 - 1555085b6c80]   60.494913 {2}{shutdown}: Outstanding message on node 0
[0 - 1555085b6c80]   60.495045 {2}{shutdown}: Outstanding message on node 0
[0 - 1555085b6c80]   60.495094 {2}{shutdown}: FAILED SHUTDOWN PHASE 2! Trying again...
[0 - 1555085aac80]   60.495120 {2}{shutdown}: Received notification on node 0 for phase 1
[0 - 1555085aac80]   60.508011 {2}{shutdown}: FAILED SHUTDOWN PHASE 1! Trying again...
[0 - 1555085b6c80]   60.508106 {2}{shutdown}: Received notification on node 0 for phase 1
[0 - 1555085b6c80]   60.522047 {2}{shutdown}: SHUTDOWN PHASE 1 SUCCESS!
[0 - 1555085b6c80]   60.522057 {2}{shutdown}: Received notification on node 0 for phase 2
[0 - 1555085b6c80]   60.528060 {2}{shutdown}: Outstanding message on node 0
[0 - 1555085b6c80]   60.529228 {2}{shutdown}: Outstanding message on node 0
[0 - 1555085b6c80]   60.529278 {2}{shutdown}: FAILED SHUTDOWN PHASE 2! Trying again...
[0 - 15550859ec80]   60.529305 {2}{shutdown}: Received notification on node 0 for phase 1
[0 - 1555085b6c80]   60.539523 {2}{shutdown}: SHUTDOWN PHASE 1 SUCCESS!
[0 - 1555085b6c80]   60.539537 {2}{shutdown}: Received notification on node 0 for phase 2
[0 - 1555085b6c80]   60.545758 {2}{shutdown}: Outstanding message on node 0
[0 - 1555085b6c80]   60.545877 {2}{shutdown}: Outstanding message on node 0
[0 - 1555085b6c80]   60.545925 {2}{shutdown}: FAILED SHUTDOWN PHASE 2! Trying again...
[0 - 15550859ec80]   60.545943 {2}{shutdown}: Received notification on node 0 for phase 1
[0 - 1555085b6c80]   60.560360 {2}{shutdown}: SHUTDOWN PHASE 1 SUCCESS!
[0 - 1555085b6c80]   60.560370 {2}{shutdown}: Received notification on node 0 for phase 2
[0 - 15550859ec80]   60.565228 {2}{shutdown}: Outstanding message on node 0
[0 - 15550859ec80]   60.566399 {2}{shutdown}: Outstanding message on node 0
[0 - 15550859ec80]   60.566448 {2}{shutdown}: FAILED SHUTDOWN PHASE 2! Trying again...
[0 - 1555085b6c80]   60.566473 {2}{shutdown}: Received notification on node 0 for phase 1
[0 - 1555085b6c80]   60.580877 {2}{shutdown}: SHUTDOWN PHASE 1 SUCCESS!
[0 - 1555085b6c80]   60.580888 {2}{shutdown}: Received notification on node 0 for phase 2
[0 - 1555085b6c80]   60.585870 {2}{shutdown}: Outstanding message on node 0
[0 - 1555085b6c80]   60.585983 {2}{shutdown}: Outstanding message on node 0
[0 - 1555085b6c80]   60.586031 {2}{shutdown}: FAILED SHUTDOWN PHASE 2! Trying again...
[0 - 1555085aac80]   60.586055 {2}{shutdown}: Received notification on node 0 for phase 1
[0 - 1555085b6c80]   60.601841 {2}{shutdown}: FAILED SHUTDOWN PHASE 1! Trying again...
[0 - 1555085aac80]   60.601921 {2}{shutdown}: Received notification on node 0 for phase 1
[0 - 1555085b6c80]   60.616143 {2}{shutdown}: SHUTDOWN PHASE 1 SUCCESS!
[0 - 1555085b6c80]   60.616153 {2}{shutdown}: Received notification on node 0 for phase 2
[0 - 15550859ec80]   60.621079 {2}{shutdown}: Outstanding message on node 0
[0 - 15550859ec80]   60.621206 {2}{shutdown}: Outstanding message on node 0
[0 - 15550859ec80]   60.621255 {2}{shutdown}: FAILED SHUTDOWN PHASE 2! Trying again...
[0 - 1555085aac80]   60.621275 {2}{shutdown}: Received notification on node 0 for phase 1
[0 - 1555085aac80]   60.635582 {2}{shutdown}: SHUTDOWN PHASE 1 SUCCESS!
[0 - 1555085aac80]   60.635593 {2}{shutdown}: Received notification on node 0 for phase 2
[0 - 1555085b6c80]   60.641514 {2}{shutdown}: Outstanding message on node 0
[0 - 1555085b6c80]   60.641642 {2}{shutdown}: Outstanding message on node 0
[0 - 1555085b6c80]   60.641692 {2}{shutdown}: FAILED SHUTDOWN PHASE 2! Trying again...
...

run_1.log:

[1 - 15550859ec80]   60.401550 {2}{shutdown}: Received notification on node 1 for phase 1
[1 - 1555085b6c80]   60.434349 {2}{shutdown}: Received notification on node 1 for phase 1
[1 - 1555085aac80]   60.446780 {2}{shutdown}: Received notification on node 1 for phase 2
[1 - 1555085b6c80]   60.450632 {2}{shutdown}: Outstanding message on node 1
[1 - 1555085b6c80]   60.450682 {2}{shutdown}: Outstanding message on node 1
[1 - 1555085b6c80]   60.451850 {2}{shutdown}: Outstanding message on node 1
[1 - 1555085b6c80]   60.455888 {2}{shutdown}: Received notification on node 1 for phase 1
[1 - 1555085b6c80]   60.473159 {2}{shutdown}: Received notification on node 1 for phase 1
[1 - 1555085b6c80]   60.487330 {2}{shutdown}: Received notification on node 1 for phase 2
[1 - 15550859ec80]   60.491942 {2}{shutdown}: Outstanding message on node 1
[1 - 15550859ec80]   60.493038 {2}{shutdown}: Outstanding message on node 1
[1 - 15550859ec80]   60.494186 {2}{shutdown}: Outstanding message on node 1
[1 - 1555085aac80]   60.497934 {2}{shutdown}: Received notification on node 1 for phase 1
[1 - 15550859ec80]   60.510917 {2}{shutdown}: Received notification on node 1 for phase 1
[1 - 1555085b6c80]   60.522602 {2}{shutdown}: Received notification on node 1 for phase 2
[1 - 1555085b6c80]   60.527231 {2}{shutdown}: Outstanding message on node 1
[1 - 1555085b6c80]   60.528330 {2}{shutdown}: Outstanding message on node 1
[1 - 1555085b6c80]   60.528419 {2}{shutdown}: Outstanding message on node 1
[1 - 1555085b6c80]   60.531905 {2}{shutdown}: Received notification on node 1 for phase 1
[1 - 1555085b6c80]   60.540241 {2}{shutdown}: Received notification on node 1 for phase 2
[1 - 15550859ec80]   60.544910 {2}{shutdown}: Outstanding message on node 1
[1 - 15550859ec80]   60.546011 {2}{shutdown}: Outstanding message on node 1
[1 - 15550859ec80]   60.546103 {2}{shutdown}: Outstanding message on node 1
[1 - 1555085b6c80]   60.548511 {2}{shutdown}: Received notification on node 1 for phase 1
[1 - 1555085aac80]   60.560907 {2}{shutdown}: Received notification on node 1 for phase 2
[1 - 1555085b6c80]   60.564389 {2}{shutdown}: Outstanding message on node 1
[1 - 1555085b6c80]   60.565490 {2}{shutdown}: Outstanding message on node 1
[1 - 1555085b6c80]   60.565585 {2}{shutdown}: Outstanding message on node 1
[1 - 15550859ec80]   60.569012 {2}{shutdown}: Received notification on node 1 for phase 1
[1 - 1555085b6c80]   60.581421 {2}{shutdown}: Received notification on node 1 for phase 2
[1 - 1555085aac80]   60.585062 {2}{shutdown}: Outstanding message on node 1
[1 - 1555085aac80]   60.586136 {2}{shutdown}: Outstanding message on node 1
[1 - 1555085aac80]   60.586233 {2}{shutdown}: Outstanding message on node 1
[1 - 15550859ec80]   60.588593 {2}{shutdown}: Received notification on node 1 for phase 1
[1 - 1555085b6c80]   60.604447 {2}{shutdown}: Received notification on node 1 for phase 1
[1 - 1555085b6c80]   60.616687 {2}{shutdown}: Received notification on node 1 for phase 2
[1 - 1555085aac80]   60.620228 {2}{shutdown}: Outstanding message on node 1
[1 - 1555085aac80]   60.621329 {2}{shutdown}: Outstanding message on node 1
[1 - 1555085aac80]   60.621453 {2}{shutdown}: Outstanding message on node 1
[1 - 1555085b6c80]   60.623816 {2}{shutdown}: Received notification on node 1 for phase 1
[1 - 1555085aac80]   60.636127 {2}{shutdown}: Received notification on node 1 for phase 2
[1 - 1555085b6c80]   60.639635 {2}{shutdown}: Outstanding message on node 1
[1 - 1555085b6c80]   60.640734 {2}{shutdown}: Outstanding message on node 1
[1 - 1555085b6c80]   60.640816 {2}{shutdown}: Outstanding message on node 1
...

lightsighter commented 1 year ago

Do we have backtraces for this new form of hang?

I looked at a hanging run on sapling and there were no interesting backtraces. The main thread in each process was just blocked waiting on Realm::wait_for_shutdown. I'll try poking at it again.

On blaze I'm seeing a lot of stuff like this.

That is consistent with the backtraces at the beginning of this issue and are the ones we need to figure out what kind of distributed collectable is not being collected using the instructions I gave above.

syamajala commented 1 year ago

Heres what I see:

>>> where
#0  Legion::Internal::DistributedCollectable::check_for_downgrade (this=0x154b0684d420, owner=12) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/garbage_collection.cc:1008
#1  0x000015554b14ce4f in Legion::Internal::DistributedCollectable::process_downgrade_request (this=0x154b0684d420, owner=12, to_check=Legion::Internal::DistributedCollectable::GLOBAL_REF_STATE) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/garbage_collection.cc:1099
#2  0x000015554b14cd26 in Legion::Internal::DistributedCollectable::handle_downgrade_request (runtime=0xce14cb0, derez=..., source=12) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/garbage_collection.cc:1077
#3  0x000015554b8b373b in Legion::Internal::Runtime::handle_did_downgrade_request (this=0xce14cb0, derez=..., source=12) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/runtime.cc:24721
#4  0x000015554b8848ac in Legion::Internal::VirtualChannel::handle_messages (this=0x154b1aa612e0, num_messages=1, runtime=0xce14cb0, remote_address_space=12, args=0x154aa18e46e0 "", arglen=32) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/runtime.cc:12285
#5  0x000015554b883a18 in Legion::Internal::VirtualChannel::process_message (this=0x154b1aa612e0, args=0x154aa18e46c4, arglen=52, runtime=0xce14cb0, remote_address_space=12) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/runtime.cc:11746
#6  0x000015554b8860be in Legion::Internal::MessageManager::receive_message (this=0x154b1a96d300, args=0x154aa18e46c0, arglen=60) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/runtime.cc:13492
#7  0x000015554b8b7ab0 in Legion::Internal::Runtime::process_message_task (this=0xce14cb0, args=0x154aa18e46bc, arglen=64) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/runtime.cc:26564
#8  0x000015554b8cd49b in Legion::Internal::Runtime::legion_runtime_task (args=0x154aa18e46b0, arglen=68, userdata=0xce2e710, userlen=8, p=...) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/runtime.cc:32361
#9  0x0000155547a9d26c in Realm::LocalTaskProcessor::execute_task (this=0xd24a390, func_id=4, task_args=...) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/proc_impl.cc:1175
#10 0x0000155547b11f9a in Realm::Task::execute_on_processor (this=0x154aa18e4190, p=...) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/tasks.cc:326
#11 0x0000155547b16cbe in Realm::UserThreadTaskScheduler::execute_task (this=0x4fe3e50, task=0x154aa18e4190) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/tasks.cc:1687
#12 0x0000155547b14d45 in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x4fe3e50) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/tasks.cc:1160
#13 0x0000155547b1c736 in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop> (obj=0x4fe3e50) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/threads.inl:97
#14 0x0000155547b29fdd in Realm::UserThread::uthread_entry () at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/threads.cc:1355
#15 0x00001555528722e0 in ?? () from /lib64/libc.so.6
#16 0x0000000000000000 in ?? ()
>>> p did
$3 = 216172782113786540

>>> where
#0  Legion::Internal::DistributedCollectable::check_for_downgrade (this=0x154b067f8db0, owner=4) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/garbage_collection.cc:1008
#1  0x000015554b14ce4f in Legion::Internal::DistributedCollectable::process_downgrade_request (this=0x154b067f8db0, owner=4, to_check=Legion::Internal::DistributedCollectable::GLOBAL_REF_STATE) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/garbage_collection.cc:1099
#2  0x000015554b14cd26 in Legion::Internal::DistributedCollectable::handle_downgrade_request (runtime=0xce14cb0, derez=..., source=4) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/garbage_collection.cc:1077
#3  0x000015554b8b373b in Legion::Internal::Runtime::handle_did_downgrade_request (this=0xce14cb0, derez=..., source=4) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/runtime.cc:24721
#4  0x000015554b8848ac in Legion::Internal::VirtualChannel::handle_messages (this=0x154b1a7f7290, num_messages=1, runtime=0xce14cb0, remote_address_space=4, args=0x154a9ddd5e90 "", arglen=32) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/runtime.cc:12285
#5  0x000015554b883a18 in Legion::Internal::VirtualChannel::process_message (this=0x154b1a7f7290, args=0x154a9ddd5e74, arglen=52, runtime=0xce14cb0, remote_address_space=4) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/runtime.cc:11746
#6  0x000015554b8860be in Legion::Internal::MessageManager::receive_message (this=0x154b1827c9f0, args=0x154a9ddd5e70, arglen=60) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/runtime.cc:13492
#7  0x000015554b8b7ab0 in Legion::Internal::Runtime::process_message_task (this=0xce14cb0, args=0x154a9ddd5e6c, arglen=64) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/runtime.cc:26564
#8  0x000015554b8cd49b in Legion::Internal::Runtime::legion_runtime_task (args=0x154a9ddd5e60, arglen=68, userdata=0xce2e490, userlen=8, p=...) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/runtime.cc:32361
#9  0x0000155547a9d26c in Realm::LocalTaskProcessor::execute_task (this=0xd249fa0, func_id=4, task_args=...) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/proc_impl.cc:1175
#10 0x0000155547b11f9a in Realm::Task::execute_on_processor (this=0x154a9ddd5940, p=...) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/tasks.cc:326
#11 0x0000155547b16cbe in Realm::UserThreadTaskScheduler::execute_task (this=0xae35bc0, task=0x154a9ddd5940) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/tasks.cc:1687
#12 0x0000155547b14d45 in Realm::ThreadedTaskScheduler::scheduler_loop (this=0xae35bc0) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/tasks.cc:1160
#13 0x0000155547b1c736 in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop> (obj=0xae35bc0) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/threads.inl:97
#14 0x0000155547b29fdd in Realm::UserThread::uthread_entry () at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/threads.cc:1355
#15 0x00001555528722e0 in ?? () from /lib64/libc.so.6
#16 0x0000000000000000 in ?? ()
>>> p did
$4 = 216172782113786612

lightsighter commented 1 year ago

The hang on sapling is has to do with profiling and exists in the master and control replication branches and doesn't have anything to do with shardrefine.

I will need to investigate why index partition distributed collectables are not being collected.

lightsighter commented 1 year ago

I pushed a fix for the hang on sapling.

Please pull and try the latest shard refine on blaze. If it is still live-locking in the same way then break at legion_replication.cc:1008on any node and print out the did of what you hit, compute did & 0xfff % NUMBER_OF_NODES, go to that node, break on garbage_collection.cc:1188 conditioned on the did being the same as the one you had before, when you hit it print out current_state, total_sent_references, and total_received_references.

syamajala commented 1 year ago

It is shutting down on sapling now, but not on blaze.

On blaze I'm still seeing the live-lock but I never hit the conditioned breakpoint on the second node. I was able to reduce the problem to 8 nodes. Computing did & 0xfff % 8 I see 0 -> 4, 1 -> 5, 2 -> 6, 3 -> 7, but then none of the nodes 4, 5, 6, 7 ever hit the conditional garbage_collection.cc:1188 breakpoint or a breakpoint i set on garbage_collection.cc:1008 until I continue nodes 0, 1, 2, 3.

lightsighter commented 1 year ago

When you break on legion_replication.cc:1008, instead try printing downgrade_owner and then go to that node and set a conditional breakpoint on legion_replication.cc:1188 with the did.

syamajala commented 1 year ago

I was able to reproduce on sapling. Unfortunately the smallest size was 16 ranks on 2 nodes.

There are some processes on c0001: 1416, 1417, 1418, 1419, 1420, 1421, 1422, 1423 and c0002: 1495608, 1495609, 1495610, 1495611, 1495612, 1495613, 1495614, 1495615.

I can't seem to run 16 ranks on 1 node, we end up with processes in the D state on startup, the node gets drained in slurm, and then I have to reboot.

If you want to run it yourself do the following:

salloc -N 2 -p cpu --exclusive
cd /scratch2/seshu/legion_s3d_subranks/Ammonia_Cases
./ammonia_job.sh

I will have to kill my slurm job first in order for you to run it.

lightsighter commented 1 year ago

There are some processes on c0001: 1416, 1417, 1418, 1419, 1420, 1421, 1422, 1423 and c0002: 1495608, 1495609, 1495610, 1495611, 1495612, 1495613, 1495614, 1495615.

The processes seem to be gone and it looks like your job is over.

I can't seem to run 16 ranks on 1 node, we end up with processes in the D state on startup, the node gets drained in slurm, and then I have to reboot.

That needs to be reported to action@cs. It's a failure of NFS.

lightsighter commented 1 year ago

Pull and try again with the most recent shardrefine.

syamajala commented 1 year ago

It works on 16 nodes on blaze and 24 nodes on perlmutter. I'd like to try to see if we can scale to the full machine.

StanfordLegion / legion

Legion: shardrefine shutdown hang #1527