Trying DG-Legion with collective

bandokihiro commented 1 year ago

I played with the collective branch on sapling. Since the mode of failure changed compared to last time I tried (mapper error) and for future debugging if needed, I'll report the errors I am triggering here.

The backtrace associated with the current mode of failure is the following:

#0  0x00007f36606dd23f in __GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0, req=req@entry=0x7f362ddd6e20, rem=rem@entry=0x7f362ddd6e20) at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:78
#1  0x00007f36606e2ec7 in __GI___nanosleep (requested_time=requested_time@entry=0x7f362ddd6e20, remaining=remaining@entry=0x7f362ddd6e20) at nanosleep.c:27
#2  0x00007f36606e2dfe in __sleep (seconds=0) at ../sysdeps/posix/sleep.c:55
#3  0x000055a9a2096e05 in Realm::realm_freeze (signal=6) at /home/bandokihiro/scratch/Builds_GPU/legion/runtime/realm/runtime_impl.cc:179
#4  <signal handler called>
#5  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#6  0x00007f3660622859 in __GI_abort () at abort.c:79
#7  0x00007f3660622729 in __assert_fail_base (fmt=0x7f36607b8588 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x55a9a2d98ba0 "finder != collective_analyses.end()", file=0x55a9a2d97a68 "/home/bandokihiro/scratch/Builds_GPU/legion/runtime/legion/legion_views.cc", line=2798, function=<optimized out>) at assert.c:92
#8  0x00007f3660633fd6 in __GI___assert_fail (assertion=0x55a9a2d98ba0 "finder != collective_analyses.end()", file=0x55a9a2d97a68 "/home/bandokihiro/scratch/Builds_GPU/legion/runtime/legion/legion_views.cc", line=2798, function=0x55a9a2d98be8 "void Legion::Internal::IndividualView::unregister_collective_analysis(size_t, unsigned int)") at assert.c:101
#9  0x000055a9a0ec50ba in Legion::Internal::IndividualView::unregister_collective_analysis (this=0x7f27f8024590, context_index=96, region_index=0) at /home/bandokihiro/scratch/Builds_GPU/legion/runtime/legion/legion_views.cc:2798
#10 0x000055a9a0edd313 in Legion::Internal::CollectiveView::finalize_collective_user (this=0x7f27f801a1c0, usage=..., user_mask=..., expr=0x7f28140366a0, op_id=494, op_ctx_index=96, index=0, collect_event=..., local_registered=..., global_registered=..., local_applied=..., global_applied=..., ready_events=std::vector of length 1, capacity 1 = {...}, term_events=std::vector of length 1, capacity 1 = {...}, trace_info=0x7f27f404ac30, symbolic=false) at /home/bandokihiro/scratch/Builds_GPU/legion/runtime/legion/legion_views.cc:8536
#11 0x000055a9a0edc61c in Legion::Internal::CollectiveView::process_register_user_request (this=0x7f27f801a1c0, op_ctx_index=96, index=0, registered=..., applied=...) at /home/bandokihiro/scratch/Builds_GPU/legion/runtime/legion/legion_views.cc:8364
#12 0x000055a9a0edc96c in Legion::Internal::CollectiveView::handle_register_user_request (runtime=0x55ae89467400, derez=...) at /home/bandokihiro/scratch/Builds_GPU/legion/runtime/legion/legion_views.cc:8411
#13 0x000055a9a1085649 in Legion::Internal::Runtime::handle_collective_user_request (this=0x55ae89467400, derez=...) at /home/bandokihiro/scratch/Builds_GPU/legion/runtime/legion/runtime.cc:24402
#14 0x000055a9a10530be in Legion::Internal::VirtualChannel::handle_messages (this=0x7f27fc032640, num_messages=1, runtime=0x55ae89467400, remote_address_space=1, args=0x7f27f000ec00 "", arglen=52) at /home/bandokihiro/scratch/Builds_GPU/legion/runtime/legion/runtime.cc:12356
#15 0x000055a9a1051ff3 in Legion::Internal::VirtualChannel::process_message (this=0x7f27fc032640, args=0x7f27f000ebe4, arglen=72, runtime=0x55ae89467400, remote_address_space=1) at /home/bandokihiro/scratch/Builds_GPU/legion/runtime/legion/runtime.cc:11703
#16 0x000055a9a10544e9 in Legion::Internal::MessageManager::receive_message (this=0x7f27fc003850, args=0x7f27f000ebe0, arglen=80) at /home/bandokihiro/scratch/Builds_GPU/legion/runtime/legion/runtime.cc:13335
#17 0x000055a9a108a168 in Legion::Internal::Runtime::process_message_task (this=0x55ae89467400, args=0x7f27f000ebdc, arglen=84) at /home/bandokihiro/scratch/Builds_GPU/legion/runtime/legion/runtime.cc:26138
#18 0x000055a9a10a23d0 in Legion::Internal::Runtime::legion_runtime_task (args=0x7f27f000ebd0, arglen=88, userdata=0x55ae8b87a890, userlen=8, p=...) at /home/bandokihiro/scratch/Builds_GPU/legion/runtime/legion/runtime.cc:31960
#19 0x000055a9a207bc0a in Realm::LocalTaskProcessor::execute_task (this=0x55ae8944c220, func_id=4, task_args=...) at /home/bandokihiro/scratch/Builds_GPU/legion/runtime/realm/proc_impl.cc:1135
#20 0x000055a9a20e2dff in Realm::Task::execute_on_processor (this=0x7f326005dbf0, p=...) at /home/bandokihiro/scratch/Builds_GPU/legion/runtime/realm/tasks.cc:302
#21 0x000055a9a20e705a in Realm::KernelThreadTaskScheduler::execute_task (this=0x55ae8944c560, task=0x7f326005dbf0) at /home/bandokihiro/scratch/Builds_GPU/legion/runtime/realm/tasks.cc:1366
#22 0x000055a9a20e5d9e in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x55ae8944c560) at /home/bandokihiro/scratch/Builds_GPU/legion/runtime/realm/tasks.cc:1105
#23 0x000055a9a20e63ef in Realm::ThreadedTaskScheduler::scheduler_loop_wlock (this=0x55ae8944c560) at /home/bandokihiro/scratch/Builds_GPU/legion/runtime/realm/tasks.cc:1217
#24 0x000055a9a20ee584 in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop_wlock> (obj=0x55ae8944c560) at /home/bandokihiro/scratch/Builds_GPU/legion/runtime/realm/threads.inl:97
#25 0x000055a9a20fb85e in Realm::KernelThread::pthread_entry (data=0x7f27f4040f00) at /home/bandokihiro/scratch/Builds_GPU/legion/runtime/realm/threads.cc:774
#26 0x00007f366b7b5609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#27 0x00007f366071f133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

The command line was the following

mpirun --report-bindings -n 4 -npernode 4 --map-by numa:PE=5 ./wrapper.sh /scratch2/bandokihiro/Builds_GPU/DG-Legion/build_cr/exec/solver -logfile logs/log_%.log -hdf5:forcerw -dm:same_address_space -ll:cpu 0 -ll:util 1 -ll:bgwork 1 -ll:bgworkpin 1 -ll:csize 20000 -ll:rsize 0 -ll:gpu 1 -ll:fsize 14000 -ll:onuma 0 -ll:ht_sharing 1 -ll:ocpu 1 -ll:othr 1 -ll:ib_rsize 512m -ll:ib_zsize 0 -cuda:legacysync 1 -lg:warn -lg:partcheck -lg:safe_ctrlrepl 1 -lg:safe_mapper -ll:force_kthreads

lightsighter commented 1 year ago

The most recent error message from #1235 should have reported an error message (although the line numbers are no longer accurate so I can't tell exactly what it was complaining about). Can you at least report the error message that it prints out now?

lightsighter commented 1 year ago

Can we also try to get updated line numbers for the backtraces? One thing you might try to make sure processes don't exit early is to put a very long sleep call before you exit any of the processes to make sure none exit early and give the job scheduler permission to tear down the job.

bandokihiro commented 1 year ago

There are no obvious error messages from what I can tell. In debug mode, the current backtrace is the following

#9  0x000020000a737014 in __assert_fail () from /lib64/power9/libc.so.6
#10 0x00000000131c981c in Legion::Internal::LayoutDescription::compute_copy_offsets (this=0x201c3a29eab0, copy_mask=..., instance=..., fields=...) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/legion_instances.cc:363
#11 0x00000000131cda4c in Legion::Internal::PhysicalManager::compute_copy_offsets (this=0x201c3a373790, copy_mask=..., fields=...) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/legion_instances.cc:1133
#12 0x000000001210fa84 in Legion::Internal::IndividualView::copy_from (this=0x201c3a373be0, src_view=0x200f8a643f20, precondition=..., predicate_guard=..., reduction_op_id=0, copy_expression=0x200f8a66af30, op=0x201c3a31abc0, index=0, collective_match_space=20515, copy_mask=..., src_point=0x200f8a1890b0, trace_info=..., recorded_events=...,
    applied_events=..., across_helper=0x0, manage_dst_events=true, copy_restricted=false, need_valid_return=false) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/legion_views.cc:2422
#13 0x0000000012f4f464 in Legion::Internal::CopyFillAggregator::issue_copies (this=0x201c46194120, target=0x201c3a373be0, copies=..., recorded_events=..., precondition=..., copy_mask=..., trace_info=..., manage_dst_events=true, restricted_output=false, dst_events=0x0) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/legion_analysis.cc:5493
#14 0x0000000012f4d8a4 in Legion::Internal::CopyFillAggregator::perform_updates (this=0x201c46194120, updates=..., trace_info=..., precondition=..., recorded_events=..., redop_index=-1, manage_dst_events=true, restricted_output=false, dst_events=0x0) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/legion_analysis.cc:5224
#15 0x0000000012f4ceb8 in Legion::Internal::CopyFillAggregator::issue_updates (this=0x201c46194120, trace_info=..., precondition=..., restricted_output=false, manage_dst_events=true, dst_events=0x0) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/legion_analysis.cc:5067
#16 0x0000000012f4fdc8 in Legion::Internal::CopyFillAggregator::handle_aggregation (args=0x201c3a372650) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/legion_analysis.cc:5571
#17 0x0000000012386380 in Legion::Internal::Runtime::legion_runtime_task (args=0x201c3a372650, arglen=108, userdata=0x53150d50, userlen=8, p=...) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/runtime.cc:32011
#18 0x000000001394e170 in Realm::LocalTaskProcessor::execute_task (this=0x54b74b90, func_id=4, task_args=...) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/realm/proc_impl.cc:1135
#19 0x00000000139ee274 in Realm::Task::execute_on_processor (this=0x200075647fe0, p=...) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/realm/tasks.cc:302
#20 0x00000000139f46b0 in Realm::UserThreadTaskScheduler::execute_task (this=0x54886240, task=0x200075647fe0) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/realm/tasks.cc:1632
#21 0x00000000139f1d5c in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x54886240) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/realm/tasks.cc:1105
#22 0x00000000139ff0dc in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop> (obj=0x54886240) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/realm/threads.inl:97
#23 0x0000000013a15bdc in Realm::UserThread::uthread_entry () at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/realm/threads.cc:1355

I am gonna try adding back the assertion and make it hang with your tip.

bandokihiro commented 1 year ago

The sleep trick did not work. It hit the inserted assertion at runtime/legion/legion_instances.cc:244.

elliottslaughter commented 1 year ago

Did you flush all streams (or at least stdout/stderr) before the sleep? If this is a multi-node job, you might also try running with stdbuf -o0 -e0 to be extra sure there's no additional level of buffering.

lightsighter commented 1 year ago

I think @bandokihiro has a different problem which is that his job is being spiked by the job scheduler before he can go debug it.

The sleep trick did not work. It hit the inserted assertion at runtime/legion/legion_instances.cc:244.

Are you sure you're on the right version of control_replication. I don't see an assert on that line. There is one on 252 of that same file. I do see the assert on line 363 of legion_instances.cc that you reported earlier.

bandokihiro commented 1 year ago

The one on line 244 was inserted by me following your patch above.

bandokihiro commented 1 year ago

I might have to run sleep via jsrun. Right now, I can't access my files in my project dir for some reason.

lightsighter commented 1 year ago

Ok, you mis-interpreted where to put the sleep then. I was recommending you put it in your main function before you exit the process (e.g. after calling Runtime::wait_for_shutdown) which will prevent other processes which don't hit the error from exiting and allowing the job scheduler to kill your job (most job schedulers will start spiking processes not long after they see at least one process of the job exit since they know that those other processes shouldn't continue existing). There's no guarantee that is what is happening, just the most likely cause that I know of for the behavior you are observing.

bandokihiro commented 1 year ago

Oh I see. Makes sense. I'll try. However, I think processes that don't hit the error "shouldn't" be able to reach the end of the program since they depend on data coming from processes that hang. I did not see any sensitivity of how long it takes for the job to get killed with the input number of iterations.

bandokihiro commented 1 year ago

Putting a sleep at the end of main did not help as well.

StanfordLegion / legion

Trying DG-Legion with collective #1344