Open bandokihiro opened 1 year ago
The most recent error message from #1235 should have reported an error message (although the line numbers are no longer accurate so I can't tell exactly what it was complaining about). Can you at least report the error message that it prints out now?
Can we also try to get updated line numbers for the backtraces? One thing you might try to make sure processes don't exit early is to put a very long sleep call before you exit any of the processes to make sure none exit early and give the job scheduler permission to tear down the job.
There are no obvious error messages from what I can tell. In debug mode, the current backtrace is the following
#9 0x000020000a737014 in __assert_fail () from /lib64/power9/libc.so.6
#10 0x00000000131c981c in Legion::Internal::LayoutDescription::compute_copy_offsets (this=0x201c3a29eab0, copy_mask=..., instance=..., fields=...) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/legion_instances.cc:363
#11 0x00000000131cda4c in Legion::Internal::PhysicalManager::compute_copy_offsets (this=0x201c3a373790, copy_mask=..., fields=...) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/legion_instances.cc:1133
#12 0x000000001210fa84 in Legion::Internal::IndividualView::copy_from (this=0x201c3a373be0, src_view=0x200f8a643f20, precondition=..., predicate_guard=..., reduction_op_id=0, copy_expression=0x200f8a66af30, op=0x201c3a31abc0, index=0, collective_match_space=20515, copy_mask=..., src_point=0x200f8a1890b0, trace_info=..., recorded_events=...,
applied_events=..., across_helper=0x0, manage_dst_events=true, copy_restricted=false, need_valid_return=false) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/legion_views.cc:2422
#13 0x0000000012f4f464 in Legion::Internal::CopyFillAggregator::issue_copies (this=0x201c46194120, target=0x201c3a373be0, copies=..., recorded_events=..., precondition=..., copy_mask=..., trace_info=..., manage_dst_events=true, restricted_output=false, dst_events=0x0) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/legion_analysis.cc:5493
#14 0x0000000012f4d8a4 in Legion::Internal::CopyFillAggregator::perform_updates (this=0x201c46194120, updates=..., trace_info=..., precondition=..., recorded_events=..., redop_index=-1, manage_dst_events=true, restricted_output=false, dst_events=0x0) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/legion_analysis.cc:5224
#15 0x0000000012f4ceb8 in Legion::Internal::CopyFillAggregator::issue_updates (this=0x201c46194120, trace_info=..., precondition=..., restricted_output=false, manage_dst_events=true, dst_events=0x0) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/legion_analysis.cc:5067
#16 0x0000000012f4fdc8 in Legion::Internal::CopyFillAggregator::handle_aggregation (args=0x201c3a372650) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/legion_analysis.cc:5571
#17 0x0000000012386380 in Legion::Internal::Runtime::legion_runtime_task (args=0x201c3a372650, arglen=108, userdata=0x53150d50, userlen=8, p=...) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/runtime.cc:32011
#18 0x000000001394e170 in Realm::LocalTaskProcessor::execute_task (this=0x54b74b90, func_id=4, task_args=...) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/realm/proc_impl.cc:1135
#19 0x00000000139ee274 in Realm::Task::execute_on_processor (this=0x200075647fe0, p=...) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/realm/tasks.cc:302
#20 0x00000000139f46b0 in Realm::UserThreadTaskScheduler::execute_task (this=0x54886240, task=0x200075647fe0) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/realm/tasks.cc:1632
#21 0x00000000139f1d5c in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x54886240) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/realm/tasks.cc:1105
#22 0x00000000139ff0dc in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop> (obj=0x54886240) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/realm/threads.inl:97
#23 0x0000000013a15bdc in Realm::UserThread::uthread_entry () at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/realm/threads.cc:1355
I am gonna try adding back the assertion and make it hang with your tip.
The sleep trick did not work. It hit the inserted assertion at runtime/legion/legion_instances.cc:244
.
Did you flush all streams (or at least stdout/stderr) before the sleep? If this is a multi-node job, you might also try running with stdbuf -o0 -e0
to be extra sure there's no additional level of buffering.
I think @bandokihiro has a different problem which is that his job is being spiked by the job scheduler before he can go debug it.
The sleep trick did not work. It hit the inserted assertion at runtime/legion/legion_instances.cc:244.
Are you sure you're on the right version of control_replication
. I don't see an assert on that line. There is one on 252 of that same file. I do see the assert on line 363 of legion_instances.cc that you reported earlier.
The one on line 244 was inserted by me following your patch above.
I might have to run sleep via jsrun. Right now, I can't access my files in my project dir for some reason.
Ok, you mis-interpreted where to put the sleep then. I was recommending you put it in your main
function before you exit the process (e.g. after calling Runtime::wait_for_shutdown
) which will prevent other processes which don't hit the error from exiting and allowing the job scheduler to kill your job (most job schedulers will start spiking processes not long after they see at least one process of the job exit since they know that those other processes shouldn't continue existing). There's no guarantee that is what is happening, just the most likely cause that I know of for the behavior you are observing.
Oh I see. Makes sense. I'll try. However, I think processes that don't hit the error "shouldn't" be able to reach the end of the program since they depend on data coming from processes that hang. I did not see any sensitivity of how long it takes for the job to get killed with the input number of iterations.
Putting a sleep at the end of main did not help as well.
I played with the collective branch on sapling. Since the mode of failure changed compared to last time I tried (mapper error) and for future debugging if needed, I'll report the errors I am triggering here.
The backtrace associated with the current mode of failure is the following:
The command line was the following