StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
670 stars 146 forks source link

cannonMM-cuda hang on Lassen 8 nodes #1377

Open Anjiang-Wei opened 1 year ago

Anjiang-Wei commented 1 year ago

The application is cannonMM-cuda, which is a GPU-based GEMM algorithm implemented in the DISTAL paper. It hangs when running on 8 nodes on Lassen.

The backtraces can be accessed both on Sapling /scratch2/anjiang/exp/taco/hanglog or https://github.com/Anjiang-Wei/taco/tree/distal-pldi-2022/hanglog

The application code is quite simple: https://github.com/Anjiang-Wei/taco/blob/distal-pldi-2022/legion/cannonMM/main.cpp https://github.com/Anjiang-Wei/taco/blob/distal-pldi-2022/legion/cannonMM/taco-generated.cu More specifically, the way I invoke the binary will leverage TacoMapper (instead of the DSL Mapper that I am working on): https://github.com/Anjiang-Wei/taco/blob/distal-pldi-2022/legion/src/taco_mapper.cpp

The script used to generate the backtrace is https://github.com/Anjiang-Wei/taco/blob/distal-pldi-2022/scripts/dump_all_backtraces.sh (which will invoke https://github.com/Anjiang-Wei/taco/blob/distal-pldi-2022/scripts/dump_node_backtraces.sh)

The submission script for the job is: https://github.com/Anjiang-Wei/taco/blob/distal-pldi-2022/build/cannonMM-cuda/hang8.lsf Note that the submission script includes -tm:enable_backpressure -tm:backpressure_max_in_flight 1 -ll:defalloc 0 -gex:obcount 4096 -ll:force_kthreads, which turns on backpressure, something that is supported in TacoMapper, and also a very big obcount number. Both with and without -ll:force_kthreads will hang.

Legion library is built with RelWithDebInfo (with debugging symbols and optimizations), commit c9db93715e8f0c17983f97e6fd0da7a2fb199ceb (commit message: Date: Tue Jan 10 19:32:56 2023 -0800 legion: make sure we have unique events for all future instances when profiling), and the application itself is built in release mode. I cannot reproduce the bug in DEBUG mode. I cannot reproduce the hanging bug in RelWithDebInfo build mode with profiler on: -lg:prof 8 -lg:prof_logfile prof_%.gz.

Let me know if I need to share the compilation/installation script for this application and the full steps to reproduce the hanging bug on Lassen.

lightsighter commented 1 year ago

Whatever is happening in these three backtraces from this file are probably the likely culprits (@streichler):

Thread 4 (Thread 0x20002ffef8b0 (LWP 58492)):
#0  syscall () at ../sysdeps/unix/sysv/linux/powerpc/syscall.S:29
#1  0x00002000060fdbb0 in Realm::Doorbell::wait_slow (this=0x20002fff0350) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/mutex.cc:316
#2  0x00002000060febd4 in wait (this=<optimized out>) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/mutex.inl:81
#3  Realm::FIFOMutex::lock_slow (this=0x14e039e0) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/mutex.cc:839
#4  0x0000200006044988 in lock (this=0x14e039e0) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/mutex.inl:220
#5  AutoLock (mutex=..., this=<synthetic pointer>) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/mutex.inl:329
#6  Realm::TaskQueue::enqueue_task (this=0x14e039e0, task=0x2035100784b0) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/tasks.cc:508
#7  0x0000200005ff070c in Realm::ProcessorGroupImpl::enqueue_task (this=<error reading variable: value has been optimized out>, task=<optimized out>) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/proc_impl.cc:701
#8  0x0000200005ff1f5c in Realm::ProcessorImpl::enqueue_or_defer_task (this=<optimized out>, task=<optimized out>, start_event=..., cache=<optimized out>) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/proc_impl.cc:513
#9  0x0000200005ff67bc in Realm::ProcessorGroupImpl::spawn_task (this=0x14e038c0, func_id=<optimized out>, args=0x20002ffee6b8, arglen=<optimized out>, reqs=..., start_event=..., finish_event=0x15b464a8, finish_gen=<optimized out>, priority=<optimized out>) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/proc_impl.cc:790
#10 0x0000200005ff1694 in Realm::Processor::spawn (this=<optimized out>, func_id=<optimized out>, args=0x20002ffee6b8, arglen=24, wait_on=..., priority=<optimized out>) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/proc_impl.cc:88
#11 0x00002000053a5a30 in issue_runtime_meta_task<Legion::Internal::ProcessorManager::SchedulerArgs> (target=..., precondition=..., priority=Legion::Internal::LG_LATENCY_WORK_PRIORITY, args=..., this=0x14dc1f10) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/legion/runtime.h:4791
#12 Legion::Internal::ProcessorManager::perform_scheduling (this=0x14e06ef0) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/legion/runtime.cc:7283
#13 0x00002000053a5b8c in Legion::Internal::Runtime::process_schedule_request (this=0x14dc1f10, proc=...) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/legion/runtime.cc:25686
#14 0x00002000053fa30c in Legion::Internal::Runtime::legion_runtime_task (args=0x2035100499f0, arglen=12, userdata=<optimized out>, userlen=<optimized out>, p=...) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/legion/runtime.cc:31443
#15 0x0000200005ff11d0 in Realm::LocalTaskProcessor::execute_task (this=0x13f12630, func_id=<optimized out>, task_args=...) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/proc_impl.cc:1135
#16 0x0000200006047178 in Realm::Task::execute_on_processor (this=0x203510049870, p=...) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/tasks.cc:302
#17 0x0000200006047424 in Realm::KernelThreadTaskScheduler::execute_task (this=<optimized out>, task=<optimized out>) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/tasks.cc:1366
#18 0x000020000604a874 in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x13f0a5b0) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/tasks.cc:1105
#19 0x000020000604b08c in Realm::ThreadedTaskScheduler::scheduler_loop_wlock (this=0x13f0a5b0) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/tasks.cc:1217
#20 0x000020000604e2e4 in Realm::KernelThread::pthread_entry (data=0x2034f1e9d640) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/threads.cc:781
#21 0x00002000000f8cd4 in start_thread (arg=0x20002ffef8b0) at pthread_create.c:309
#22 0x00002000086e7f14 in clone () at ../sysdeps/unix/sysv/linux/powerpc/powerpc64/clone.S:104
Thread 7 (Thread 0x20350c67f8b0 (LWP 58478)):
#0  syscall () at ../sysdeps/unix/sysv/linux/powerpc/syscall.S:29
#1  0x00002000060fdbb0 in Realm::Doorbell::wait_slow (this=0x20350c680350) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/mutex.cc:316
#2  0x00002000060febd4 in wait (this=<optimized out>) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/mutex.inl:81
#3  Realm::FIFOMutex::lock_slow (this=0x14e039e0) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/mutex.cc:839
#4  0x0000200006044988 in lock (this=0x14e039e0) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/mutex.inl:220
#5  AutoLock (mutex=..., this=<synthetic pointer>) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/mutex.inl:329
#6  Realm::TaskQueue::enqueue_task (this=0x14e039e0, task=0x20350410c790) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/tasks.cc:508
#7  0x0000200005ff070c in Realm::ProcessorGroupImpl::enqueue_task (this=<error reading variable: value has been optimized out>, task=<optimized out>) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/proc_impl.cc:701
#8  0x0000200005ff1f5c in Realm::ProcessorImpl::enqueue_or_defer_task (this=<optimized out>, task=<optimized out>, start_event=..., cache=<optimized out>) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/proc_impl.cc:513
#9  0x0000200005ff67bc in Realm::ProcessorGroupImpl::spawn_task (this=0x14e038c0, func_id=<optimized out>, args=0x20350c67e6b8, arglen=<optimized out>, reqs=..., start_event=..., finish_event=0x15b46b20, finish_gen=<optimized out>, priority=<optimized out>) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/proc_impl.cc:790
#10 0x0000200005ff1694 in Realm::Processor::spawn (this=<optimized out>, func_id=<optimized out>, args=0x20350c67e6b8, arglen=24, wait_on=..., priority=<optimized out>) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/proc_impl.cc:88
#11 0x00002000053a5a30 in issue_runtime_meta_task<Legion::Internal::ProcessorManager::SchedulerArgs> (target=..., precondition=..., priority=Legion::Internal::LG_LATENCY_WORK_PRIORITY, args=..., this=0x14dc1f10) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/legion/runtime.h:4791
#12 Legion::Internal::ProcessorManager::perform_scheduling (this=0x14e075e0) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/legion/runtime.cc:7283
#13 0x00002000053a5b8c in Legion::Internal::Runtime::process_schedule_request (this=0x14dc1f10, proc=...) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/legion/runtime.cc:25686
#14 0x00002000053fa30c in Legion::Internal::Runtime::legion_runtime_task (args=0x2035040d19f0, arglen=12, userdata=<optimized out>, userlen=<optimized out>, p=...) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/legion/runtime.cc:31443
#15 0x0000200005ff11d0 in Realm::LocalTaskProcessor::execute_task (this=0x13ef6950, func_id=<optimized out>, task_args=...) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/proc_impl.cc:1135
#16 0x0000200006047178 in Realm::Task::execute_on_processor (this=0x2035040d1870, p=...) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/tasks.cc:302
#17 0x0000200006047424 in Realm::KernelThreadTaskScheduler::execute_task (this=<optimized out>, task=<optimized out>) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/tasks.cc:1366
#18 0x000020000604a874 in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x13ef2a70) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/tasks.cc:1105
#19 0x000020000604b08c in Realm::ThreadedTaskScheduler::scheduler_loop_wlock (this=0x13ef2a70) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/tasks.cc:1217
#20 0x000020000604e2e4 in Realm::KernelThread::pthread_entry (data=0x2034d53e51c0) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/threads.cc:781
#21 0x00002000000f8cd4 in start_thread (arg=0x20350c67f8b0) at pthread_create.c:309
#22 0x00002000086e7f14 in clone () at ../sysdeps/unix/sysv/linux/powerpc/powerpc64/clone.S:104
Thread 11 (Thread 0x20257fd3f8b0 (LWP 58473)):
#0  syscall () at ../sysdeps/unix/sysv/linux/powerpc/syscall.S:29
#1  0x00002000060fdbb0 in Realm::Doorbell::wait_slow (this=0x20257fd40350) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/mutex.cc:316
#2  0x00002000060febd4 in wait (this=<optimized out>) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/mutex.inl:81
#3  Realm::FIFOMutex::lock_slow (this=0x14e039e0) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/mutex.cc:839
#4  0x0000200006044988 in lock (this=0x14e039e0) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/mutex.inl:220
#5  AutoLock (mutex=..., this=<synthetic pointer>) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/mutex.inl:329
#6  Realm::TaskQueue::enqueue_task (this=0x14e039e0, task=0x20354c01fd00) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/tasks.cc:508
#7  0x0000200005ff070c in Realm::ProcessorGroupImpl::enqueue_task (this=<error reading variable: value has been optimized out>, task=<optimized out>) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/proc_impl.cc:701
#8  0x0000200005ff1f5c in Realm::ProcessorImpl::enqueue_or_defer_task (this=<optimized out>, task=<optimized out>, start_event=..., cache=<optimized out>) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/proc_impl.cc:513
#9  0x0000200005ff67bc in Realm::ProcessorGroupImpl::spawn_task (this=0x14e038c0, func_id=<optimized out>, args=0x20257fd3e6b8, arglen=<optimized out>, reqs=..., start_event=..., finish_event=0x15b46d48, finish_gen=<optimized out>, priority=<optimized out>) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/proc_impl.cc:790
#10 0x0000200005ff1694 in Realm::Processor::spawn (this=<optimized out>, func_id=<optimized out>, args=0x20257fd3e6b8, arglen=24, wait_on=..., priority=<optimized out>) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/proc_impl.cc:88
#11 0x00002000053a5a30 in issue_runtime_meta_task<Legion::Internal::ProcessorManager::SchedulerArgs> (target=..., precondition=..., priority=Legion::Internal::LG_LATENCY_WORK_PRIORITY, args=..., this=0x14dc1f10) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/legion/runtime.h:4791
#12 Legion::Internal::ProcessorManager::perform_scheduling (this=0x14e06800) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/legion/runtime.cc:7283
#13 0x00002000053a5b8c in Legion::Internal::Runtime::process_schedule_request (this=0x14dc1f10, proc=...) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/legion/runtime.cc:25686
#14 0x00002000053fa30c in Legion::Internal::Runtime::legion_runtime_task (args=0x20354c023a40, arglen=12, userdata=<optimized out>, userlen=<optimized out>, p=...) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/legion/runtime.cc:31443
#15 0x0000200005ff11d0 in Realm::LocalTaskProcessor::execute_task (this=0x13f530c0, func_id=<optimized out>, task_args=...) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/proc_impl.cc:1135
#16 0x0000200006047178 in Realm::Task::execute_on_processor (this=0x20354c0238c0, p=...) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/tasks.cc:302
#17 0x0000200006047424 in Realm::KernelThreadTaskScheduler::execute_task (this=<optimized out>, task=<optimized out>) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/tasks.cc:1366
#18 0x000020000604a874 in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x13f16250) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/tasks.cc:1105
#19 0x000020000604b08c in Realm::ThreadedTaskScheduler::scheduler_loop_wlock (this=0x13f16250) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/tasks.cc:1217
#20 0x000020000604e2e4 in Realm::KernelThread::pthread_entry (data=0x2035082e5bb0) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/threads.cc:781
#21 0x00002000000f8cd4 in start_thread (arg=0x20257fd3f8b0) at pthread_create.c:309
#22 0x00002000086e7f14 in clone () at ../sysdeps/unix/sysv/linux/powerpc/powerpc64/clone.S:104
rohany commented 1 year ago

This isn't a startup hang right? it looks like the application is at least running some tasks.

lightsighter commented 1 year ago

That's right, it's not a start-up hang because stuff is running.

Anjiang-Wei commented 1 year ago

The pumma-cuda algorithm from the DISTAL benchmark also gets stuck on 8 nodes: The following logs may be interesting (https://github.com/Anjiang-Wei/taco/blob/distal-pldi-2022/hanglog/pumma_8node/bt_lassen222_167201.log#L382_L419) More specifically, #6 Realm::TaskQueue::enqueue_task (this=0x14e31640, task=0x2035480929e0) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/tasks.cc:508 and #6 Realm::TaskQueue::get_best_task (queues=std::vector of length 2, capacity 2 = {...}, task_priority=@0x20005cadeb80: -2147483647) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/tasks.cc:470 might be informative respectively.

Thread 13 (Thread 0x20005e74f8b0 (LWP 168425)):
#0  syscall () at ../sysdeps/unix/sysv/linux/powerpc/syscall.S:29
#1  0x00002000060fdbb0 in Realm::Doorbell::wait_slow (this=0x20005e750350) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/mutex.cc:316
#2  0x00002000060febd4 in wait (this=<optimized out>) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/mutex.inl:81
#3  Realm::FIFOMutex::lock_slow (this=0x14e31640) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/mutex.cc:839
#4  0x0000200006044988 in lock (this=0x14e31640) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/mutex.inl:220
#5  AutoLock (mutex=..., this=<synthetic pointer>) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/mutex.inl:329
#6  Realm::TaskQueue::enqueue_task (this=0x14e31640, task=0x2035480929e0) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/tasks.cc:508
#7  0x0000200005ff070c in Realm::ProcessorGroupImpl::enqueue_task (this=<error reading variable: value has been optimized out>, task=<optimized out>) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/proc_impl.cc:701
#8  0x0000200005ff1f5c in Realm::ProcessorImpl::enqueue_or_defer_task (this=<optimized out>, task=<optimized out>, start_event=..., cache=<optimized out>) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/proc_impl.cc:513
#9  0x0000200005ff67bc in Realm::ProcessorGroupImpl::spawn_task (this=0x14e31520, func_id=<optimized out>, args=0x20005e74e6b8, arglen=<optimized out>, reqs=..., start_event=..., finish_event=0x15b122c8, finish_gen=<optimized out>, priority=<optimized out>) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/proc_impl.cc:790
#10 0x0000200005ff1694 in Realm::Processor::spawn (this=<optimized out>, func_id=<optimized out>, args=0x20005e74e6b8, arglen=24, wait_on=..., priority=<optimized out>) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/proc_impl.cc:88
#11 0x00002000053a5a30 in issue_runtime_meta_task<Legion::Internal::ProcessorManager::SchedulerArgs> (target=..., precondition=..., priority=Legion::Internal::LG_LATENCY_WORK_PRIORITY, args=..., this=0x14e10010) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/legion/runtime.h:4791
#12 Legion::Internal::ProcessorManager::perform_scheduling (this=0x14e35280) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/legion/runtime.cc:7283
#13 0x00002000053a5b8c in Legion::Internal::Runtime::process_schedule_request (this=0x14e10010, proc=...) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/legion/runtime.cc:25686
#14 0x00002000053fa30c in Legion::Internal::Runtime::legion_runtime_task (args=0x20354808ec40, arglen=12, userdata=<optimized out>, userlen=<optimized out>, p=...) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/legion/runtime.cc:31443
#15 0x0000200005ff11d0 in Realm::LocalTaskProcessor::execute_task (this=0x13f10060, func_id=<optimized out>, task_args=...) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/proc_impl.cc:1135
#16 0x0000200006047178 in Realm::Task::execute_on_processor (this=0x20354808eac0, p=...) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/tasks.cc:302
#17 0x0000200006047424 in Realm::KernelThreadTaskScheduler::execute_task (this=<optimized out>, task=<optimized out>) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/tasks.cc:1366
#18 0x000020000604a874 in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x13f07fe0) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/tasks.cc:1105
#19 0x000020000604b08c in Realm::ThreadedTaskScheduler::scheduler_loop_wlock (this=0x13f07fe0) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/tasks.cc:1217
#20 0x000020000604e2e4 in Realm::KernelThread::pthread_entry (data=0x20352c0c2ce0) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/threads.cc:781
#21 0x00002000000f8cd4 in start_thread (arg=0x20005e74f8b0) at pthread_create.c:309
#22 0x00002000086e7f14 in clone () at ../sysdeps/unix/sysv/linux/powerpc/powerpc64/clone.S:104
Thread 12 (Thread 0x20005cadf8b0 (LWP 168426)):
#0  syscall () at ../sysdeps/unix/sysv/linux/powerpc/syscall.S:29
#1  0x00002000060fdbb0 in Realm::Doorbell::wait_slow (this=0x20005cae0350) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/mutex.cc:316
#2  0x00002000060febd4 in wait (this=<optimized out>) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/mutex.inl:81
#3  Realm::FIFOMutex::lock_slow (this=0x14e31640) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/mutex.cc:839
#4  0x0000200006042788 in lock (this=0x14e31640) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/mutex.inl:220
#5  AutoLock (mutex=..., this=<synthetic pointer>) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/mutex.inl:329
#6  Realm::TaskQueue::get_best_task (queues=std::vector of length 2, capacity 2 = {...}, task_priority=@0x20005cadeb80: -2147483647) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/tasks.cc:470
#7  0x000020000604aa18 in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x13f0bec0) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/tasks.cc:1068
#8  0x000020000604b08c in Realm::ThreadedTaskScheduler::scheduler_loop_wlock (this=0x13f0bec0) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/tasks.cc:1217
#9  0x000020000604e2e4 in Realm::KernelThread::pthread_entry (data=0x20353803b6b0) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/threads.cc:781
#10 0x00002000000f8cd4 in start_thread (arg=0x20005cadf8b0) at pthread_create.c:309
#11 0x00002000086e7f14 in clone () at ../sysdeps/unix/sysv/linux/powerpc/powerpc64/clone.S:104
Anjiang-Wei commented 1 year ago

Another algorithm called summa-cuda algorithm from the DISTAL benchmark also gets stuck on 8 nodes: The following logs may be interesting (https://github.com/Anjiang-Wei/taco/blob/distal-pldi-2022/hanglog/summa_8node/bt_lassen391_99882.log#L398) More specifically, it is again related with Realm::TaskQueue::enqueue_task

Thread 10 (Thread 0x20005cf3f8b0 (LWP 105455)):
#0  syscall () at ../sysdeps/unix/sysv/linux/powerpc/syscall.S:29
#1  0x00002000060fdbb0 in Realm::Doorbell::wait_slow (this=0x20005cf40350) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/mutex.cc:316
#2  0x00002000060febd4 in wait (this=<optimized out>) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/mutex.inl:81
#3  Realm::FIFOMutex::lock_slow (this=0x14e11870) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/mutex.cc:839
#4  0x0000200006044988 in lock (this=0x14e11870) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/mutex.inl:220
#5  AutoLock (mutex=..., this=<synthetic pointer>) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/mutex.inl:329
#6  Realm::TaskQueue::enqueue_task (this=0x14e11870, task=0x203538003f90) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/tasks.cc:508
#7  0x0000200005ff070c in Realm::ProcessorGroupImpl::enqueue_task (this=<error reading variable: value has been optimized out>, task=<optimized out>) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/proc_impl.cc:701
#8  0x0000200005ff1f5c in Realm::ProcessorImpl::enqueue_or_defer_task (this=<optimized out>, task=<optimized out>, start_event=..., cache=<optimized out>) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/proc_impl.cc:513
#9  0x0000200005ff67bc in Realm::ProcessorGroupImpl::spawn_task (this=0x14e11750, func_id=<optimized out>, args=0x20005cf3e6b8, arglen=<optimized out>, reqs=..., start_event=..., finish_event=0x15b391f0, finish_gen=<optimized out>, priority=<optimized out>) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/proc_impl.cc:790
#10 0x0000200005ff1694 in Realm::Processor::spawn (this=<optimized out>, func_id=<optimized out>, args=0x20005cf3e6b8, arglen=24, wait_on=..., priority=<optimized out>) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/proc_impl.cc:88
#11 0x00002000053a5a30 in issue_runtime_meta_task<Legion::Internal::ProcessorManager::SchedulerArgs> (target=..., precondition=..., priority=Legion::Internal::LG_LATENCY_WORK_PRIORITY, args=..., this=0x14df0010) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/legion/runtime.h:4791
#12 Legion::Internal::ProcessorManager::perform_scheduling (this=0x14e154e0) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/legion/runtime.cc:7283
#13 0x00002000053a5b8c in Legion::Internal::Runtime::process_schedule_request (this=0x14df0010, proc=...) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/legion/runtime.cc:25686
#14 0x00002000053fa30c in Legion::Internal::Runtime::legion_runtime_task (args=0x203538071a10, arglen=12, userdata=<optimized out>, userlen=<optimized out>, p=...) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/legion/runtime.cc:31443
#15 0x0000200005ff11d0 in Realm::LocalTaskProcessor::execute_task (this=0x13f105b0, func_id=<optimized out>, task_args=...) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/proc_impl.cc:1135
#16 0x0000200006047178 in Realm::Task::execute_on_processor (this=0x203538071890, p=...) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/tasks.cc:302
#17 0x0000200006047424 in Realm::KernelThreadTaskScheduler::execute_task (this=<optimized out>, task=<optimized out>) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/tasks.cc:1366
#18 0x000020000604a874 in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x13f08530) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/tasks.cc:1105
#19 0x000020000604b08c in Realm::ThreadedTaskScheduler::scheduler_loop_wlock (this=0x13f08530) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/tasks.cc:1217
#20 0x000020000604e2e4 in Realm::KernelThread::pthread_entry (data=0x2034d813d900) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/threads.cc:781
#21 0x00002000000f8cd4 in start_thread (arg=0x20005cf3f8b0) at pthread_create.c:309
#22 0x00002000086e7f14 in clone () at ../sysdeps/unix/sysv/linux/powerpc/powerpc64/clone.S:104

Another similar backtrace is here: https://github.com/Anjiang-Wei/taco/blob/distal-pldi-2022/hanglog/summa_8node/bt_lassen546_28372.log#L424

Anjiang-Wei commented 1 year ago

The last algorithm that hangs is called solominik-cuda from the DISTAL benchmark, and it hangs at 128 node counts. The following logs may be interesting, but please note that I did not inspect all of the logs, so I may have missed important information.

https://github.com/Anjiang-Wei/taco/blob/distal-pldi-2022/hanglog/solomonik_128node/bt_lassen121_150206.log#L246 (compare_exchange)

Thread 15 (Thread 0x2000598ef8b0 (LWP 150446)):
#0  0x00002000060cebb4 in compare_exchange_strong (__m2=std::memory_order_acquire, __m1=std::memory_order_acq_rel, __i2=0, __i1=<synthetic pointer>, this=0x101e8da8) at /usr/tce/packages/gcc/gcc-7.3.1/rh/usr/include/c++/7/bits/atomic_base.h:477
#1  compare_exchange (newval=0, expected=<synthetic pointer>, this=0x101e8da8) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/atomics.inl:154
#2  unlock (this=0x101e8da8) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/mutex.inl:194
#3  Realm::GASNetEXPoller::do_work (this=0x101e8d70, work_until=...) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/gasnetex/gasnetex_internal.cc:2812
#4  0x0000200005f3eafc in Realm::BackgroundWorkManager::Worker::do_work (this=0x2000598eeb58, max_time_in_ns=-1, interrupt_flag=0x0) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/bgwork.cc:621
#5  0x0000200005f3f3f0 in Realm::BackgroundWorkThread::main_loop (this=0x15a12620) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/bgwork.cc:125
#6  0x000020000604e2e4 in Realm::KernelThread::pthread_entry (data=0x15a126f0) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/threads.cc:781
#7  0x00002000000f8cd4 in start_thread (arg=0x2000598ef8b0) at pthread_create.c:309
#8  0x00002000086e7f14 in clone () at ../sysdeps/unix/sysv/linux/powerpc/powerpc64/clone.S:104

https://github.com/Anjiang-Wei/taco/blob/distal-pldi-2022/hanglog/solomonik_128node/bt_lassen271_166811.log#L199 (_gasnet_AMPoll ())

Thread 21 (Thread 0x200058fcf8b0 (LWP 167042)):
#0  0x00002000066532fc in gasnetc_snd_reap () from /g/g92/wei8/release_exp/taco/deps-install/cmake-install/lib64/librealm.so.1
#1  0x000020000663d540 in gasnetc_AMPoll () from /g/g92/wei8/release_exp/taco/deps-install/cmake-install/lib64/librealm.so.1
#2  0x00002000060cea44 in _gasneti_AMPoll () at /g/g92/wei8/release_exp/taco/deps-install/legion-build/embed-gasnet/install/include/gasnet_help.h:1298
#3  _gasnet_AMPoll () at /g/g92/wei8/release_exp/taco/deps-install/legion-build/embed-gasnet/install/include/gasnet_help.h:1511
#4  Realm::GASNetEXPoller::do_work (this=0x101e8d70, work_until=...) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/gasnetex/gasnetex_internal.cc:2904
#5  0x0000200005f3eafc in Realm::BackgroundWorkManager::Worker::do_work (this=0x200058fceb58, max_time_in_ns=-1, interrupt_flag=0x0) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/bgwork.cc:621
#6  0x0000200005f3f3f0 in Realm::BackgroundWorkThread::main_loop (this=0x15954110) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/bgwork.cc:125
#7  0x000020000604e2e4 in Realm::KernelThread::pthread_entry (data=0x1594fcd0) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/threads.cc:781
#8  0x00002000000f8cd4 in start_thread (arg=0x200058fcf8b0) at pthread_create.c:309
#9  0x00002000086e7f14 in clone () at ../sysdeps/unix/sysv/linux/powerpc/powerpc64/clone.S:104

https://github.com/Anjiang-Wei/taco/blob/distal-pldi-2022/hanglog/solomonik_128node/bt_lassen114_77175.log#L381 (Realm::TaskQueue::enqueue_task)

https://github.com/Anjiang-Wei/taco/blob/distal-pldi-2022/hanglog/solomonik_128node/bt_lassen114_77175.log#L406 (Realm::TaskQueue::get_best_task)

Anjiang-Wei commented 1 year ago

All the logs can be accessed from both Sapling /scratch2/anjiang/exp/taco/hanglog and https://github.com/Anjiang-Wei/taco/tree/distal-pldi-2022/hanglog

Many thanks!

streichler commented 1 year ago

@Anjiang-Wei can you try adding #define REALM_NO_USE_FUTEX to runtime/realm/realm_config.h and see if anything changes? You've got processes with in which multiple threads are waiting for a mutex that nobody seems to be holding, which could be a bug in the mutex code or memory corruption that's clobbering a mutex state. Either way, changing the mutex implementation (which is what that define does) might result in illuminating behavior.

Anjiang-Wei commented 1 year ago

I added the line into realm_config.h, recompiled everything, and reran the cannon 8 node experiment. It still hangs, and the backtrace seems different. I have uploaded them: https://github.com/Anjiang-Wei/taco/tree/distal-pldi-2022/hanglog/cannon_8node_NO_USE_FUTEX

Some interesting traces may include: https://github.com/Anjiang-Wei/taco/blob/distal-pldi-2022/hanglog/cannon_8node_NO_USE_FUTEX/bt_lassen315_21420.log#L103

Thread 33 (Thread 0x20002cfcf8b0 (LWP 21599)):
#0  0x00002000060ce7dc in fetch_or (__m=std::memory_order_acq_rel, __i=1, this=0x101e9528) at /usr/tce/packages/gcc/gcc-7.3.1/rh/usr/include/c++/7/bits/atomic_base.h:544
#1  fetch_or_acqrel (to_or=1, this=0x101e9528) at /g/g92/wei8/debug_exp/taco/legion/legion/runtime/realm/atomics.inl:258
#2  trylock (this=0x101e9528) at /g/g92/wei8/debug_exp/taco/legion/legion/runtime/realm/mutex.inl:183
#3  Realm::GASNetEXPoller::do_work (this=0x101e94f0, work_until=...) at /g/g92/wei8/debug_exp/taco/legion/legion/runtime/realm/gasnetex/gasnetex_internal.cc:2809
#4  0x0000200005f3eb3c in Realm::BackgroundWorkManager::Worker::do_work (this=0x20002cfceb58, max_time_in_ns=-1, interrupt_flag=0x0) at /g/g92/wei8/debug_exp/taco/legion/legion/runtime/realm/bgwork.cc:621
#5  0x0000200005f3f430 in Realm::BackgroundWorkThread::main_loop (this=0x13e2b6f0) at /g/g92/wei8/debug_exp/taco/legion/legion/runtime/realm/bgwork.cc:125
#6  0x000020000604e324 in Realm::KernelThread::pthread_entry (data=0x140bb7f0) at /g/g92/wei8/debug_exp/taco/legion/legion/runtime/realm/threads.cc:781
#7  0x00002000000f8cd4 in start_thread (arg=0x20002cfcf8b0) at pthread_create.c:309
#8  0x00002000086e7f14 in clone () at ../sysdeps/unix/sysv/linux/powerpc/powerpc64/clone.S:10

https://github.com/Anjiang-Wei/taco/blob/distal-pldi-2022/hanglog/cannon_8node_NO_USE_FUTEX/bt_lassen572_156165.log#L266

Thread 15 (Thread 0x20002ef6f8b0 (LWP 156366)):
#0  0x0000200006654a00 in gasnetc_poll_rcv_hca () from /g/g92/wei8/debug_exp/taco/deps-install/cmake-install/lib64/librealm.so.1
#1  0x00002000066549f8 in gasnetc_poll_rcv_hca () from /g/g92/wei8/debug_exp/taco/deps-install/cmake-install/lib64/librealm.so.1
Backtrace stopped: frame did not save the PC

https://github.com/Anjiang-Wei/taco/blob/distal-pldi-2022/hanglog/cannon_8node_NO_USE_FUTEX/bt_lassen722_42006.log#L122

Thread 31 (Thread 0x20002d22f8b0 (LWP 42191)):
#0  0x0000200006653090 in gasnetc_snd_reap () from /g/g92/wei8/debug_exp/taco/deps-install/cmake-install/lib64/librealm.so.1
#1  0x000020000663d320 in gasnetc_AMPoll () from /g/g92/wei8/debug_exp/taco/deps-install/cmake-install/lib64/librealm.so.1
#2  0x00002000060cea84 in _gasneti_AMPoll () at /g/g92/wei8/debug_exp/taco/deps-install/legion-build/embed-gasnet/install/include/gasnet_help.h:1298
#3  _gasnet_AMPoll () at /g/g92/wei8/debug_exp/taco/deps-install/legion-build/embed-gasnet/install/include/gasnet_help.h:1511
#4  Realm::GASNetEXPoller::do_work (this=0x101e94f0, work_until=...) at /g/g92/wei8/debug_exp/taco/legion/legion/runtime/realm/gasnetex/gasnetex_internal.cc:2904
#5  0x0000200005f3eb3c in Realm::BackgroundWorkManager::Worker::do_work (this=0x20002d22eb58, max_time_in_ns=-1, interrupt_flag=0x0) at /g/g92/wei8/debug_exp/taco/legion/legion/runtime/realm/bgwork.cc:621
#6  0x0000200005f3f430 in Realm::BackgroundWorkThread::main_loop (this=0x13e2e890) at /g/g92/wei8/debug_exp/taco/legion/legion/runtime/realm/bgwork.cc:125
#7  0x000020000604e324 in Realm::KernelThread::pthread_entry (data=0x13e2a6f0) at /g/g92/wei8/debug_exp/taco/legion/legion/runtime/realm/threads.cc:781
#8  0x00002000000f8cd4 in start_thread (arg=0x20002d22f8b0) at pthread_create.c:309
#9  0x00002000086e7f14 in clone () at ../sysdeps/unix/sysv/linux/powerpc/powerpc64/clone.S:104
streichler commented 1 year ago

Everything reported here would also be consistent with an app that is just running lots and lots of tasks that don't yield any visible progress. Can you run with -level task=2 and see if the logs are still growing when the app looks to be hanging?

Anjiang-Wei commented 1 year ago

Running with -level task=2 -logfile tasklog_%.log, the program terminates normally without hanging.

lightsighter commented 1 year ago

How many times did you run it?

lightsighter commented 1 year ago

Can you also get it to hang and then confirm whether the backtraces are stable or not upon allowing the processes to continue and then preempting them again?

Anjiang-Wei commented 1 year ago

Thanks for the suggestion! I observed that -level task=2 result is unstable. I ran several times with -level task=2 flag on, and it can either succeed or observes an assertion failure:

cannonMM-cuda: /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/logging.cc:129: virtual void Realm::LoggerFileStream::write(const char*, size_t): A
ssertion `amt == len' failed.
*** Caught a fatal signal (proc 4): SIGABRT(6)
*** NOTICE (proc 4): Before reporting bugs, run with GASNET_BACKTRACE=1 in the environment to generate a backtrace.
*** NOTICE (proc 4): We recommend linking the debug version of GASNet to assist you in resolving this application issue.
WARNING: ODP shutdown in signal context
ERROR:  One or more process (first noticed rank 4) terminated with signal 6

The full command line for the app is jsrun -b none -c ALL_CPUS -g ALL_GPUS -r 1 -n 8 ../bin/cannonMM-cuda -n 56568 -gx 4 -gy 2 -dm:exact_region -tm:untrack_valid_regions -ll:ocpu 1 -ll:othr 10 -ll:csize 150000 -ll:util 4 -dm:replicate 1 -ll:gpu 4 -ll:fsize 15000 -ll:bgwork 12 -ll:bgnumapin 1 -tm:enable_backpressure -tm:backpressure_max_in_flight 1 -ll:defalloc 0 -gex:obcount 4096 -ll:force_kthreads -level task=2 -logfile tasklog_%.log

The tasklog_.log files are huge. I ran `for f in $(ls .log); do echo $f; tail -2 $f; done` of the failure logs to capture the last two lines of each file:

tasklog_0.log
[0 - 20257f8df8b0]   67.292812 {2}{task}: task 0x203524014f70 started: func=4 proc=1c00000000000001 arglen=56 before=0 after=800100000ee00085
[0 - 20257f8df8b0]   67.292847 {2}{task}: task 0x203524014f70 completed: func=4 proc=1c00000000000001 arglen=56 before=0 after=800100000ee00085
tasklog_1.log
[1 - 2035108af8b0]   67.292815 {2}{task}: task 0x20350c003ce0 started: func=4 proc=1c00010001000001 arglen=56 before=0 after=80010000074000e2
[1 - 2035108af8b0]   67.292844 {2}{task}: task 0x20350c003ce0 completed: func=4 proc=1c00010001000001 arglen=56 before=0 after=80010000074000e2
tasklog_2.log
[2 - 2034f8adf8b0]   67.293276 {2}{task}: task 0x2035241315e0 started: func=4 proc=1c00020002000001 arglen=24 before=8001000003100093 after=800100000b9000af
[2 - 2034f8adf8b0]   67.293298 {2}{task}: task 0x2035241315e0 completed: func=4 proc=1c00020002000001 arglen=24 before=8001000003100093 after=800100000b9000af
tasklog_3.log
[3 - 20257ff6f8b0]  306.856894 {2}{task}: task 0x2035101556b0 started: func=4 proc=1c00030003000001 arglen=32 before=8001800006500095 after=80018000086000a2
[3 - 20257ff6f8b0]  306.856939 {2}{task}: task 0x2035101556b0 completed: func=4 proc=1c00030003000001 arglen=32 before=8001800006500095 after=80018000086000a2
tasklog_4.log
[4 - 20257f24f8b0]  322.768719 {2}{task}: task 0x20354c008930 created: func=4 proc=1c00040004000001 arglen=24 before=0 after=8002000009610ac1
[4 - 2034f8d0f8b0]  32
tasklog_5.log
[5 - 20350cd0f8b0]   67.291032 {2}{task}: task 0x20351c01ea60 started: func=4 proc=1c00050005000001 arglen=2032 before=0 after=800100000eb00082
[5 - 20350cd0f8b0]   67.291129 {2}{task}: task 0x20351c01ea60 completed: func=4 proc=1c00050005000001 arglen=2032 before=0 after=800100000eb00082
tasklog_6.log
[6 - 20002f2bf8b0]   67.292822 {2}{task}: task 0x200050048570 started: func=4 proc=1c00060006000001 arglen=56 before=0 after=800100000a900063
[6 - 20002f2bf8b0]   67.292859 {2}{task}: task 0x200050048570 completed: func=4 proc=1c00060006000001 arglen=56 before=0 after=800100000a900063
tasklog_7.log
[7 - 2034fd16f8b0]  306.856802 {2}{task}: task 0x2000580734d0 started: func=4 proc=1c00070007000001 arglen=192 before=800180000930008c after=800180000b8000b6
[7 - 2034fd16f8b0]  306.856827 {2}{task}: task 0x2000580734d0 completed: func=4 proc=1c00070007000001 arglen=192 before=800180000930008c after=800180000b8000b6

You can see that tasklog_4.log does not end properly, which confirms the assertion failure on proc 4. Another observation is that the ending time of each processor's log is different -- some end around 306, one ends around 322 while most end around 67.

Anjiang-Wei commented 1 year ago

The backtrace does not seem to be stable. I collected two traces, one in hanglog1, and the second collected in hanglog2 after waiting for another 2 minutes.

hanglog1/bt_lassen407_97298.log and hanglog2/bt_lassen407_97298.log are identical. Below I will show some examples of different traces:

diff hanglog1/bt_lassen407_97327.log hanglog2/bt_lassen407_97327.log produces

120,123c120,128
< #0  0x0000200006654b90 in gasnetc_poll_rcv_hca () from /g/g92/wei8/release_exp/taco/deps-install/cmake-install/lib64/librealm.so.1
< #1  0x000020000665565c in gasnetc_poll_rcv_all () from /g/g92/wei8/release_exp/taco/deps-install/cmake-install/lib64/librealm.so.1
< #2  0x000020000665573c in gasnetc_do_poll () from /g/g92/wei8/release_exp/taco/deps-install/cmake-install/lib64/librealm.so.1
< Backtrace stopped: frame did not save the PC
---
> #0  0x00002000060cebb4 in compare_exchange_strong (__m2=std::memory_order_acquire, __m1=std::memory_order_acq_rel, __i2=0, __i1=<synthetic pointer>, this=0x101e9528) at /usr/tce/packages/gcc/gcc-7.3.1/rh/usr/include/c++/7/bits/atomic_base.h:477
> #1  compare_exchange (newval=0, expected=<synthetic pointer>, this=0x101e9528) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/atomics.inl:154
> #2  unlock (this=0x101e9528) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/mutex.inl:194
> #3  Realm::GASNetEXPoller::do_work (this=0x101e94f0, work_until=...) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/gasnetex/gasnetex_internal.cc:2812
> #4  0x0000200005f3eafc in Realm::BackgroundWorkManager::Worker::do_work (this=0x20002d22eb58, max_time_in_ns=-1, interrupt_flag=0x0) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/bgwork.cc:621
> #5  0x0000200005f3f3f0 in Realm::BackgroundWorkThread::main_loop (this=0x13e2ef50) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/bgwork.cc:125
> #6  0x000020000604e2e4 in Realm::KernelThread::pthread_entry (data=0x13e2adb0) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/threads.cc:781
> #7  0x00002000000f8cd4 in start_thread (arg=0x20002d22f8b0) at pthread_create.c:309
> #8  0x00002000086e7f14 in clone () at ../sysdeps/unix/sysv/linux/powerpc/powerpc64/clone.S:104

diff hanglog1/bt_lassen414_22794.log hanglog2/bt_lassen414_22794.log produces

264,272c264,269
< #0  0x00002000060cebb4 in compare_exchange_strong (__m2=std::memory_order_acquire, __m1=std::memory_order_acq_rel, __i2=0, __i1=<synthetic pointer>, this=0x101e9528) at /usr/tce/packages/gcc/gcc-7.3.1/rh/usr/include/c++/7/bits/atomic_base.h:477
< #1  compare_exchange (newval=0, expected=<synthetic pointer>, this=0x101e9528) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/atomics.inl:154
< #2  unlock (this=0x101e9528) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/mutex.inl:194
< #3  Realm::GASNetEXPoller::do_work (this=0x101e94f0, work_until=...) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/gasnetex/gasnetex_internal.cc:2812
< #4  0x0000200005f3eafc in Realm::BackgroundWorkManager::Worker::do_work (this=0x20002f08eb58, max_time_in_ns=-1, interrupt_flag=0x0) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/bgwork.cc:621
< #5  0x0000200005f3f3f0 in Realm::BackgroundWorkThread::main_loop (this=0x140afac0) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/bgwork.cc:125
< #6  0x000020000604e2e4 in Realm::KernelThread::pthread_entry (data=0x140ab890) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/threads.cc:781
< #7  0x00002000000f8cd4 in start_thread (arg=0x20002f08f8b0) at pthread_create.c:309
< #8  0x00002000086e7f14 in clone () at ../sysdeps/unix/sysv/linux/powerpc/powerpc64/clone.S:104
---
> #0  Realm::GASNetEXPoller::do_work (this=0x101e94f0, work_until=...) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/gasnetex/gasnetex_internal.cc:2798
> #1  0x0000200005f3eafc in Realm::BackgroundWorkManager::Worker::do_work (this=0x20002f08eb58, max_time_in_ns=-1, interrupt_flag=0x0) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/bgwork.cc:621
> #2  0x0000200005f3f3f0 in Realm::BackgroundWorkThread::main_loop (this=0x140afac0) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/bgwork.cc:125
> #3  0x000020000604e2e4 in Realm::KernelThread::pthread_entry (data=0x140ab890) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/threads.cc:781
> #4  0x00002000000f8cd4 in start_thread (arg=0x20002f08f8b0) at pthread_create.c:309
> #5  0x00002000086e7f14 in clone () at ../sysdeps/unix/sysv/linux/powerpc/powerpc64/clone.S:104

diff hanglog1/bt_lassen596_117593.log hanglog2/bt_lassen596_117593.log produces

241,244c241,248
< #0  0x0000200006654ba0 in gasnetc_poll_rcv_hca () from /g/g92/wei8/release_exp/taco/deps-install/cmake-install/lib64/librealm.so.1
< #1  0x000020000665565c in gasnetc_poll_rcv_all () from /g/g92/wei8/release_exp/taco/deps-install/cmake-install/lib64/librealm.so.1
< #2  0x000020000665573c in gasnetc_do_poll () from /g/g92/wei8/release_exp/taco/deps-install/cmake-install/lib64/librealm.so.1
< Backtrace stopped: frame did not save the PC
---
> #0  0x0000200005cd0998 in 000002af.plt_call._ZN5Realm5Clock18native_time_slowerEv () from /g/g92/wei8/release_exp/taco/deps-install/cmake-install/lib64/librealm.so.1
> #1  0x0000200005f3eb1c in native_time () at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/timers.inl:84
> #2  current_time_in_nanoseconds (absolute=true) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/timers.inl:66
> #3  Realm::BackgroundWorkManager::Worker::do_work (this=0x20002ed0eb58, max_time_in_ns=-1, interrupt_flag=0x0) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/bgwork.cc:627
> #4  0x0000200005f3f3f0 in Realm::BackgroundWorkThread::main_loop (this=0x140a7d80) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/bgwork.cc:125
> #5  0x000020000604e2e4 in Realm::KernelThread::pthread_entry (data=0x13fd1ca0) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/threads.cc:781
> #6  0x00002000000f8cd4 in start_thread (arg=0x20002ed0f8b0) at pthread_create.c:309
> #7  0x00002000086e7f14 in clone () at ../sysdeps/unix/sysv/linux/powerpc/powerpc64/clone.S:104
lightsighter commented 1 year ago

Polling in the gasnet progress thread is allowed to change and still consider the job hung. Do the threads with ProcessorGroupImpl in their backtraces change?

Anjiang-Wei commented 1 year ago

Thanks for the guidance in debugging! Yes, threads with ProcessorGroupImpl change their backtraces.

Only one node has the target ProcessorGroupImpl keyword. In the first backtrace, two threads (Thread11 and Thread4) have ProcessorGroupImpl:

Thread 11 (Thread 0x20352467f8b0 (LWP 66037)):
#0  syscall () at ../sysdeps/unix/sysv/linux/powerpc/syscall.S:29
#1  0x00002000060fdbb0 in Realm::Doorbell::wait_slow (this=0x203524680350) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/mutex.cc:316
#2  0x00002000060febd4 in wait (this=<optimized out>) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/mutex.inl:81
#3  Realm::FIFOMutex::lock_slow (this=0x14e126b0) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/mutex.cc:839
#4  0x0000200006044988 in lock (this=0x14e126b0) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/mutex.inl:220
#5  AutoLock (mutex=..., this=<synthetic pointer>) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/mutex.inl:329
#6  Realm::TaskQueue::enqueue_task (this=0x14e126b0, task=0x2035440db730) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/tasks.cc:508
#7  0x0000200005ff070c in Realm::ProcessorGroupImpl::enqueue_task (this=<error reading variable: value has been optimized out>, task=<optimized out>) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/proc_impl.cc:701
#8  0x0000200005ff1f5c in Realm::ProcessorImpl::enqueue_or_defer_task (this=<optimized out>, task=<optimized out>, start_event=..., cache=<optimized out>) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/proc_impl.cc:513
#9  0x0000200005ff67bc in Realm::ProcessorGroupImpl::spawn_task (this=0x14e12590, func_id=<optimized out>, args=0x20352467e6b8, arglen=<optimized out>, reqs=..., start_event=..., finish_event=0x15b16e40, finish_gen=<optimized out>, priority=<optimized out>) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/proc_impl.cc:790
#10 0x0000200005ff1694 in Realm::Processor::spawn (this=<optimized out>, func_id=<optimized out>, args=0x20352467e6b8, arglen=24, wait_on=..., priority=<optimized out>) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/proc_impl.cc:88
#11 0x00002000053a5a30 in issue_runtime_meta_task<Legion::Internal::ProcessorManager::SchedulerArgs> (target=..., precondition=..., priority=Legion::Internal::LG_LATENCY_WORK_PRIORITY, args=..., this=0x14df0010) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/legion/runtime.h:4791
#12 Legion::Internal::ProcessorManager::perform_scheduling (this=0x14e153f0) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/legion/runtime.cc:7283
#13 0x00002000053a5b8c in Legion::Internal::Runtime::process_schedule_request (this=0x14df0010, proc=...) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/legion/runtime.cc:25686
#14 0x00002000053fa30c in Legion::Internal::Runtime::legion_runtime_task (args=0x2035440db5d0, arglen=12, userdata=<optimized out>, userlen=<optimized out>, p=...) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/legion/runtime.cc:31443
#15 0x0000200005ff11d0 in Realm::LocalTaskProcessor::execute_task (this=0x13ef4640, func_id=<optimized out>, task_args=...) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/proc_impl.cc:1135
#16 0x0000200006047178 in Realm::Task::execute_on_processor (this=0x2035440db450, p=...) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/tasks.cc:302
#17 0x0000200006047424 in Realm::KernelThreadTaskScheduler::execute_task (this=<optimized out>, task=<optimized out>) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/tasks.cc:1366
#18 0x000020000604a874 in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x13ef0760) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/tasks.cc:1105
#19 0x000020000604b08c in Realm::ThreadedTaskScheduler::scheduler_loop_wlock (this=0x13ef0760) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/tasks.cc:1217
#20 0x000020000604e2e4 in Realm::KernelThread::pthread_entry (data=0x2035140e7dd0) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/threads.cc:7
81
#21 0x00002000000f8cd4 in start_thread (arg=0x20352467f8b0) at pthread_create.c:309
#22 0x00002000086e7f14 in clone () at ../sysdeps/unix/sysv/linux/powerpc/powerpc64/clone.S:104

and

Thread 4 (Thread 0x20352516f8b0 (LWP 66045)):
#0  syscall () at ../sysdeps/unix/sysv/linux/powerpc/syscall.S:29
#1  0x00002000060fdbb0 in Realm::Doorbell::wait_slow (this=0x203525170350) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/mutex.cc:316
#2  0x00002000060febd4 in wait (this=<optimized out>) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/mutex.inl:81
#3  Realm::FIFOMutex::lock_slow (this=0x14e126b0) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/mutex.cc:839
#4  0x0000200006044988 in lock (this=0x14e126b0) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/mutex.inl:220
#5  AutoLock (mutex=..., this=<synthetic pointer>) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/mutex.inl:329
#6  Realm::TaskQueue::enqueue_task (this=0x14e126b0, task=0x203538087550) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/tasks.cc:508
#7  0x0000200005ff070c in Realm::ProcessorGroupImpl::enqueue_task (this=<error reading variable: value has been optimized out>, task=<optimized out>) at /g
/g92/wei8/release_exp/taco/legion/legion/runtime/realm/proc_impl.cc:701
#8  0x0000200005ff1f5c in Realm::ProcessorImpl::enqueue_or_defer_task (this=<optimized out>, task=<optimized out>, start_event=..., cache=<optimized out>) 
at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/proc_impl.cc:513
#9  0x0000200005ff67bc in Realm::ProcessorGroupImpl::spawn_task (this=0x14e12590, func_id=<optimized out>, args=0x20352516e6b8, arglen=<optimized out>, req
s=..., start_event=..., finish_event=0x15b2bc78, finish_gen=<optimized out>, priority=<optimized out>) at /g/g92/wei8/release_exp/taco/legion/legion/runtim
e/realm/proc_impl.cc:790
#10 0x0000200005ff1694 in Realm::Processor::spawn (this=<optimized out>, func_id=<optimized out>, args=0x20352516e6b8, arglen=24, wait_on=..., priority=<op
timized out>) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/proc_impl.cc:88
#11 0x00002000053a5a30 in issue_runtime_meta_task<Legion::Internal::ProcessorManager::SchedulerArgs> (target=..., precondition=..., priority=Legion::Intern
al::LG_LATENCY_WORK_PRIORITY, args=..., this=0x14df0010) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/legion/runtime.h:4791
#12 Legion::Internal::ProcessorManager::perform_scheduling (this=0x14e15ae0) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/legion/runtime.cc:7283
#13 0x00002000053a5b8c in Legion::Internal::Runtime::process_schedule_request (this=0x14df0010, proc=...) at /g/g92/wei8/release_exp/taco/legion/legion/run
time/legion/runtime.cc:25686
#14 0x00002000053fa30c in Legion::Internal::Runtime::legion_runtime_task (args=0x2035380879b0, arglen=12, userdata=<optimized out>, userlen=<optimized out>
, p=...) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/legion/runtime.cc:31443
#15 0x0000200005ff11d0 in Realm::LocalTaskProcessor::execute_task (this=0x13f50db0, func_id=<optimized out>, task_args=...) at /g/g92/wei8/release_exp/taco
/legion/legion/runtime/realm/proc_impl.cc:1135
#16 0x0000200006047178 in Realm::Task::execute_on_processor (this=0x203538087830, p=...) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/tasks.
cc:302
#17 0x0000200006047424 in Realm::KernelThreadTaskScheduler::execute_task (this=<optimized out>, task=<optimized out>) at /g/g92/wei8/release_exp/taco/legio
n/legion/runtime/realm/tasks.cc:1366
#18 0x000020000604a874 in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x13f13f40) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/tasks.
cc:1105
#19 0x000020000604b08c in Realm::ThreadedTaskScheduler::scheduler_loop_wlock (this=0x13f13f40) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/
tasks.cc:1217
#20 0x000020000604e2e4 in Realm::KernelThread::pthread_entry (data=0x2034e00cab40) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/threads.cc:7
81
#21 0x00002000000f8cd4 in start_thread (arg=0x20352516f8b0) at pthread_create.c:309
#22 0x00002000086e7f14 in clone () at ../sysdeps/unix/sysv/linux/powerpc/powerpc64/clone.S:104

In the second backtrace (after waiting for ~2 minutes), only Thread4 has ProcessorGroupImpl:

Thread 4 (Thread 0x20352516f8b0 (LWP 66045)):
#0  syscall () at ../sysdeps/unix/sysv/linux/powerpc/syscall.S:29
#1  0x00002000060fdbb0 in Realm::Doorbell::wait_slow (this=0x203525170350) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/mutex.cc:316
#2  0x00002000060febd4 in wait (this=<optimized out>) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/mutex.inl:81
#3  Realm::FIFOMutex::lock_slow (this=0x14e126b0) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/mutex.cc:839
#4  0x0000200006044988 in lock (this=0x14e126b0) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/mutex.inl:220
#5  AutoLock (mutex=..., this=<synthetic pointer>) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/mutex.inl:329
#6  Realm::TaskQueue::enqueue_task (this=0x14e126b0, task=0x203538087830) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/tasks.cc:508
#7  0x0000200005ff070c in Realm::ProcessorGroupImpl::enqueue_task (this=<error reading variable: value has been optimized out>, task=<optimized out>) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/proc_impl.cc:701
#8  0x0000200005ff1f5c in Realm::ProcessorImpl::enqueue_or_defer_task (this=<optimized out>, task=<optimized out>, start_event=..., cache=<optimized out>) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/proc_impl.cc:513
#9  0x0000200005ff67bc in Realm::ProcessorGroupImpl::spawn_task (this=0x14e12590, func_id=<optimized out>, args=0x20352516e6b8, arglen=<optimized out>, reqs=..., start_event=..., finish_event=0x15b307f0, finish_gen=<optimized out>, priority=<optimized out>) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/proc_impl.cc:790
#10 0x0000200005ff1694 in Realm::Processor::spawn (this=<optimized out>, func_id=<optimized out>, args=0x20352516e6b8, arglen=24, wait_on=..., priority=<optimized out>) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/proc_impl.cc:88
#11 0x00002000053a5a30 in issue_runtime_meta_task<Legion::Internal::ProcessorManager::SchedulerArgs> (target=..., precondition=..., priority=Legion::Internal::LG_LATENCY_WORK_PRIORITY, args=..., this=0x14df0010) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/legion/runtime.h:4791
#12 Legion::Internal::ProcessorManager::perform_scheduling (this=0x14e15ae0) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/legion/runtime.cc:7283
#13 0x00002000053a5b8c in Legion::Internal::Runtime::process_schedule_request (this=0x14df0010, proc=...) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/legion/runtime.cc:25686
#14 0x00002000053fa30c in Legion::Internal::Runtime::legion_runtime_task (args=0x2035380876d0, arglen=12, userdata=<optimized out>, userlen=<optimized out>, p=...) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/legion/runtime.cc:31443
#15 0x0000200005ff11d0 in Realm::LocalTaskProcessor::execute_task (this=0x13f50db0, func_id=<optimized out>, task_args=...) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/proc_impl.cc:1135
#16 0x0000200006047178 in Realm::Task::execute_on_processor (this=0x203538087550, p=...) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/tasks.cc:302
#17 0x0000200006047424 in Realm::KernelThreadTaskScheduler::execute_task (this=<optimized out>, task=<optimized out>) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/tasks.cc:1366
#18 0x000020000604a874 in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x13f13f40) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/tasks.cc:1105
#19 0x000020000604b08c in Realm::ThreadedTaskScheduler::scheduler_loop_wlock (this=0x13f13f40) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/tasks.cc:1217
#20 0x000020000604e2e4 in Realm::KernelThread::pthread_entry (data=0x2034e00cab40) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/threads.cc:781
#21 0x00002000000f8cd4 in start_thread (arg=0x20352516f8b0) at pthread_create.c:309
#22 0x00002000086e7f14 in clone () at ../sysdeps/unix/sysv/linux/powerpc/powerpc64/clone.S:104

And thread11 has changed to the following (without ProcessorGroupImpl)

Thread 11 (Thread 0x20352467f8b0 (LWP 66037)):
#0  syscall () at ../sysdeps/unix/sysv/linux/powerpc/syscall.S:29
#1  0x00002000060fdbb0 in Realm::Doorbell::wait_slow (this=0x203524680350) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/mutex.cc:316
#2  0x00002000060febd4 in wait (this=<optimized out>) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/mutex.inl:81
#3  Realm::FIFOMutex::lock_slow (this=0x14e126b0) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/mutex.cc:839
#4  0x0000200006042788 in lock (this=0x14e126b0) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/mutex.inl:220
#5  AutoLock (mutex=..., this=<synthetic pointer>) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/mutex.inl:329
#6  Realm::TaskQueue::get_best_task (queues=std::vector of length 2, capacity 2 = {...}, task_priority=@0x20352467eb80: -2147483647) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/tasks.cc:470
#7  0x000020000604aa18 in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x13ef0760) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/tasks.cc:1068
#8  0x000020000604b08c in Realm::ThreadedTaskScheduler::scheduler_loop_wlock (this=0x13ef0760) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/tasks.cc:1217
#9  0x000020000604e2e4 in Realm::KernelThread::pthread_entry (data=0x2035140e7dd0) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/threads.cc:781
#10 0x00002000000f8cd4 in start_thread (arg=0x20352467f8b0) at pthread_create.c:309
#11 0x00002000086e7f14 in clone () at ../sysdeps/unix/sysv/linux/powerpc/powerpc64/clone.S:104

Also, note that the second trace's Thread4 and the first trace's Thread4 have slight differences with the following diff (e.g., task=0x203538087550 versus task=0x203538087830, and finish_event=0x15b2bc78 versus finish_event=0x15b307f0):

8c8
< #6  Realm::TaskQueue::enqueue_task (this=0x14e126b0, task=0x203538087550) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/tasks.cc:508
---
> #6  Realm::TaskQueue::enqueue_task (this=0x14e126b0, task=0x203538087830) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/tasks.cc:508
11c11
< #9  0x0000200005ff67bc in Realm::ProcessorGroupImpl::spawn_task (this=0x14e12590, func_id=<optimized out>, args=0x20352516e6b8, arglen=<optimized out>, reqs=..., start_event=..., finish_event=0x15b2bc78, finish_gen=<optimized out>, priority=<optimized out>) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/proc_impl.cc:790
---
> #9  0x0000200005ff67bc in Realm::ProcessorGroupImpl::spawn_task (this=0x14e12590, func_id=<optimized out>, args=0x20352516e6b8, arglen=<optimized out>, reqs=..., start_event=..., finish_event=0x15b307f0, finish_gen=<optimized out>, priority=<optimized out>) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/proc_impl.cc:790
16c16
< #14 0x00002000053fa30c in Legion::Internal::Runtime::legion_runtime_task (args=0x2035380879b0, arglen=12, userdata=<optimized out>, userlen=<optimized out>, p=...) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/legion/runtime.cc:31443
---
> #14 0x00002000053fa30c in Legion::Internal::Runtime::legion_runtime_task (args=0x2035380876d0, arglen=12, userdata=<optimized out>, userlen=<optimized out>, p=...) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/legion/runtime.cc:31443
18c18
< #16 0x0000200006047178 in Realm::Task::execute_on_processor (this=0x203538087830, p=...) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/tasks.cc:302
---
> #16 0x0000200006047178 in Realm::Task::execute_on_processor (this=0x203538087550, p=...) at /g/g92/wei8/release_exp/taco/legion/legion/runtime/realm/tasks.cc:302
lightsighter commented 1 year ago

Rebuild Legion with -DLEGION_SEPARATE_META_TASKS defined, and then rerun with -lg:registration -level runtime=2,task=2 and let it run for a bit and then kill it and report the logs back here.

Anjiang-Wei commented 1 year ago

I ran it 5 times, and all of them failed like this ERROR: One or more process (first noticed rank 7) terminated with signal 9. The logs are empty. Also, the failure does not produce a backtrace. I turned on the backtrace and logging with LEGION_BACKTRACE=1 jsrun -b none -c ALL_CPUS -g ALL_GPUS -r 1 -n 8 ../bin/cannonMM-cuda -n 56568 -gx 4 -gy 2 -dm:exact_region -tm:untrack_valid_regions -ll :ocpu 1 -ll:othr 10 -ll:csize 150000 -ll:util 4 -dm:replicate 1 -ll:gpu 4 -ll:fsize 15000 -ll:bgwork 12 -ll:bgnumapin 1 -tm:enable_backpressure -tm:backpressure_max_in_flight 1 -ll:defalloc 0 -gex:obcount 4096 -ll:force_kthreads -lg:registration -level runtime=2,task=2 -logfile tasklog_%.log.

Should I reduce the matrix size (-n 56568) and memory requirement (-ll:csize 150000, -ll:fsize 15000) and rerun?

lightsighter commented 1 year ago

Signal 9 is sigkill, probably OOM but could be the job scheduler. Go back to how you were running it before.

Anjiang-Wei commented 1 year ago

I reran it again with the -DLEGION_SEPARATE_META_TASKS-built Legion using the original command LEGION_BACKTRACE=1 jsrun -b none -c ALL_CPUS -g ALL_GPUS -r 1 -n 8 ../bin/cannonMM-cuda -n 56568 ... -ll:force_kthreads -lg:registration -level runtime=2,task=2 -logfile tasklog_%.txt.

Luckily, this time the program does not signal 9 any more and finishes normally, reporting "On 8 nodes achieved GFLOPS per node: 1371.930636", a GFLOP much lower than expected. The produced txt files are huge and imbalanced:

du -h tasklog*
309M    tasklog_0.txt
333M    tasklog_1.txt
253M    tasklog_2.txt
5.7G    tasklog_3.txt
212M    tasklog_4.txt
533M    tasklog_5.txt
607M    tasklog_6.txt
359M    tasklog_7.txt

I uploaded these logs on Sapling: http://sapling.stanford.edu/~anjiang/tasklog_0.txt http://sapling.stanford.edu/~anjiang/tasklog_1.txt ... http://sapling.stanford.edu/~anjiang/tasklog_7.txt

lightsighter commented 1 year ago

We're not going to learn anything without logs from an actual hang. Try building in release mode.

Anjiang-Wei commented 1 year ago

Thanks for the suggestion, I reproduced the hanging bug! The original logs are huge and unbalanced:

32G     tasklog_0.txt
48M     tasklog_1.txt
117M    tasklog_2.txt
168M    tasklog_3.txt
227M    tasklog_4.txt
5.7M    tasklog_5.txt
272M    tasklog_6.txt
113M    tasklog_7.txt

I truncated them to only the first 100000 lines (around 14M each after truncation) and uploaded each of the 8-node logs to Sapling: http://sapling.stanford.edu/~anjiang/tasklog_0_trun.txt ... http://sapling.stanford.edu/~anjiang/tasklog_7_trun.txt

lightsighter commented 1 year ago

We need the first 1K lines and the last 100K lines, not just the first. Also you didn't rebuild with -DLEGION_SEPARATE_META_TASKS like I asked.