Open bandokihiro opened 1 year ago
Which wrapper.sh
? You have 387 of them in your scratch directory on sapling.
Please do
cd /scratch2/bandokihiro/Issue1344/run
sbatch runslurm.sh
Pull and try again.
This error is not triggered anymore. Without safe control replication checks, it runs to completion. With level 1, I run into this which doesn't happen on control_replication:
[0 - 7ff8e1cd6000] 3.732217 {5}{runtime}: Detected control replication violation when invoking destroy_index_space in task top_level_task (UID 12) on shard 0 [Provenance: unknown]. The hash summary for the function does not align with the hash summaries from other call sites. We'll run the hash algorithm again to try to recognize what value differs between the shards, hang tight...
LEGION ERROR: Specific control replication violation occurred from member handle (from file /home/bandokihiro/scratch/Issue1344/legion/runtime/legion/legion_context.cc:13342)
That seems like a real error. You need to delete index spaces in the same order across your shards.
I am unable to locate the offending call. I commented all destroy_index_space
calls in the app and it still triggers, though after completion of most of the program. This looks like it happens when the runtime tears down things? I moved the run folder to /home/bandokihiro/scratch/Issue1344/run/1Node
.
Pull and try again.
Same error with the following backtrace:
#0 0x00007f9d9cccb23f in __GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0, req=req@entry=0x7f9d6b0a5fe0, rem=rem@entry=0x7f9d6b0a5fe0)
at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:78
#1 0x00007f9d9ccd0ec7 in __GI___nanosleep (requested_time=requested_time@entry=0x7f9d6b0a5fe0, remaining=remaining@entry=0x7f9d6b0a5fe0) at nanosleep.c:27
#2 0x00007f9d9ccd0dfe in __sleep (seconds=0) at ../sysdeps/posix/sleep.c:55
#3 0x000055d5a7fec6b5 in Realm::realm_freeze (signal=6) at /home/bandokihiro/scratch/Issue1344/legion/runtime/realm/runtime_impl.cc:179
#4 <signal handler called>
#5 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#6 0x00007f9d9cc10859 in __GI_abort () at abort.c:79
#7 0x000055d5a6ff78f1 in Legion::Internal::Runtime::report_error_message (id=607,
file_name=0x55d5a8ebbfc0 "/home/bandokihiro/scratch/Issue1344/legion/runtime/legion/legion_context.cc", line=13340,
message=0x7f9d6b0a6be0 "Specific control replication violation occurred from member handle") at /home/bandokihiro/scratch/Issue1344/legion/runtime/legion/runtime.cc:31733
#8 0x000055d5a79d3e9c in Legion::Internal::ReplicateContext::verify_hash (this=0x7f8f3c073c00, hash=0x7f9d6b0a7c50, description=0x55d5a8ec5f46 "handle", provenance=0x0,
verify_every_call=true) at /home/bandokihiro/scratch/Issue1344/legion/runtime/legion/legion_context.cc:13340
#9 0x000055d5a6ffb421 in Legion::Internal::Murmur3Hasher::verify (this=0x7f9d6b0a7d70, description=0x55d5a8ec5f46 "handle", every_call=true)
at /home/bandokihiro/scratch/Issue1344/legion/runtime/legion/legion_utilities.h:2006
#10 0x000055d5a7a25a49 in Legion::Internal::Murmur3Hasher::hash<Legion::IndexSpace> (this=0x7f9d6b0a7d70, value=..., description=0x55d5a8ec5f46 "handle")
at /home/bandokihiro/scratch/Issue1344/legion/runtime/legion/legion_utilities.h:1863
#11 0x000055d5a79d92ee in Legion::Internal::ReplicateContext::destroy_index_space (this=0x7f8f3c073c00, handle=..., unordered=false, recurse=true, provenance=0x0)
at /home/bandokihiro/scratch/Issue1344/legion/runtime/legion/legion_context.cc:14162
#12 0x000055d5a79ca8ef in Legion::Internal::InnerContext::end_task (this=0x7f8f3c073c00, res=0x0, res_size=0, owned=false, deferred_result_instance=..., callback_functor=0x0,
resource=0x0, freefunc=0x0, metadataptr=0x0, metadatasize=0) at /home/bandokihiro/scratch/Issue1344/legion/runtime/legion/legion_context.cc:11423
#13 0x000055d5a79faa8f in Legion::Internal::ReplicateContext::end_task (this=0x7f8f3c073c00, res=0x0, res_size=0, owned=false, deferred_result_instance=..., callback_functor=0x0,
resource=0x0, freefunc=0x0, metadataptr=0x0, metadatasize=0) at /home/bandokihiro/scratch/Issue1344/legion/runtime/legion/legion_context.cc:19677
#14 0x000055d5a6b25d3b in Legion::Runtime::legion_task_postamble (ctx=0x7f8f3c073c00, retvalptr=0x0, retvalsize=0, owned=false, inst=..., metadataptr=0x0, metadatasize=0)
at /home/bandokihiro/scratch/Issue1344/legion/runtime/legion/legion.cc:8042
#15 0x000055d5a66b6da5 in Legion::LegionTaskWrapper::legion_task_wrapper<&(top_level_task(Legion::Task const*, std::vector<Legion::PhysicalRegion, std::allocator<Legion::PhysicalRegion> > const&, Legion::Internal::TaskContext*, Legion::Runtime*))> (args=0x7f9d4c01b670, arglen=8, userdata=0x0, userlen=0, p=...)
at /scratch2/bandokihiro/Issue1344/legion/install/include/legion/legion.inl:20346
#16 0x000055d5a7fd14ba in Realm::LocalTaskProcessor::execute_task (this=0x55da8ddc0e30, func_id=12, task_args=...)
at /home/bandokihiro/scratch/Issue1344/legion/runtime/realm/proc_impl.cc:1135
#17 0x000055d5a80386af in Realm::Task::execute_on_processor (this=0x7f9d4c01b4f0, p=...) at /home/bandokihiro/scratch/Issue1344/legion/runtime/realm/tasks.cc:302
#18 0x000055d5a803c90a in Realm::KernelThreadTaskScheduler::execute_task (this=0x55da8ddc1140, task=0x7f9d4c01b4f0)
at /home/bandokihiro/scratch/Issue1344/legion/runtime/realm/tasks.cc:1366
#19 0x000055d5a803b64e in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x55da8ddc1140) at /home/bandokihiro/scratch/Issue1344/legion/runtime/realm/tasks.cc:1105
#20 0x000055d5a803bc9f in Realm::ThreadedTaskScheduler::scheduler_loop_wlock (this=0x55da8ddc1140) at /home/bandokihiro/scratch/Issue1344/legion/runtime/realm/tasks.cc:1217
#21 0x000055d5a8043e34 in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop_wlock> (obj=0x55da8ddc1140)
at /home/bandokihiro/scratch/Issue1344/legion/runtime/realm/threads.inl:97
#22 0x000055d5a805110e in Realm::KernelThread::pthread_entry (data=0x55da8ddee280) at /home/bandokihiro/scratch/Issue1344/legion/runtime/realm/threads.cc:774
#23 0x00007f9da7da3609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#24 0x00007f9d9cd0d133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Pull and try again.
It's fixed, thanks. I know it might be too early to try this but I have been trying a couple of runs on Summit with tracing enabled. The cases with 1, 2, and 4 nodes worked. The 8-node case failed and I reran it with debug mode. The following assertion was triggered:
#8 0x0000000012346240 in Legion::Internal::Runtime::find_messenger (this=0x209b60f0, sid=865271264) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/runtime.cc:20770
20770 assert(sid < LEGION_MAX_NUM_NODES);
(gdb) p sid
$1 = 865271264
I know there is still unimplemented functionality in tracing in the collective
branch, and I'm sure there are bugs because I haven't tested it much yet. Still get a backtrace for the assertion though since it will need to be fixed eventually.
I have found 2 failure modes. I think the app has not yet reached the explicitly traced section yet.
Failure 1:
Thread 12 (Thread 0x200052acf890 (LWP 864596)):
#0 0x000020000a7ca114 in nanosleep () from /lib64/power9/libc.so.6
#1 0x000020000a7c9f44 in sleep () from /lib64/power9/libc.so.6
#2 0x000000001396336c in Realm::realm_freeze (signal=6) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/realm/runtime_impl.cc:179
#3 <signal handler called>
#4 0x000020000a723618 in raise () from /lib64/power9/libc.so.6
#5 0x000020000a703a2c in abort () from /lib64/power9/libc.so.6
#6 0x000020000a716f70 in __assert_fail_base () from /lib64/power9/libc.so.6
#7 0x000020000a717014 in __assert_fail () from /lib64/power9/libc.so.6
#8 0x0000000012f69214 in Legion::Internal::EquivalenceSet::process_replication_response (this=0x200079023460, owner=19) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/legion_analysis.cc:12090
#9 0x0000000012f8d1e8 in Legion::Internal::EquivalenceSet::handle_replication_response (derez=..., runtime=0x293d34d0) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/legion_analysis.cc:17383
#10 0x00000000123566a4 in Legion::Internal::Runtime::handle_equivalence_set_replication_response (this=0x293d34d0, derez=...) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/runtime.cc:24951
#11 0x000000001231a850 in Legion::Internal::VirtualChannel::handle_messages (this=0x200078427a90, num_messages=1, runtime=0x293d34d0, remote_address_space=0, args=0x20005c18d500 "", arglen=28) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/runtime.cc:12794
#12 0x00000000123189bc in Legion::Internal::VirtualChannel::process_message (this=0x200078427a90, args=0x20005c18d4e4, arglen=48, runtime=0x293d34d0, remote_address_space=0) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/runtime.cc:11712
#13 0x000000001231b7dc in Legion::Internal::MessageManager::receive_message (this=0x20007821d6a0, args=0x20005c18d4e0, arglen=56) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/runtime.cc:13344
#14 0x000000001235b1dc in Legion::Internal::Runtime::process_message_task (this=0x293d34d0, args=0x20005c18d4dc, arglen=60) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/runtime.cc:26035
#15 0x00000000123790c0 in Legion::Internal::Runtime::legion_runtime_task (args=0x20005c18d4d0, arglen=64, userdata=0x293d0190, userlen=8, p=...) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/runtime.cc:31932
#16 0x000000001393af20 in Realm::LocalTaskProcessor::execute_task (this=0x292de7c0, func_id=4, task_args=...) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/realm/proc_impl.cc:1135
#17 0x00000000139d4e68 in Realm::Task::execute_on_processor (this=0x20005c9a2b90, p=...) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/realm/tasks.cc:302
#18 0x00000000139db2a4 in Realm::UserThreadTaskScheduler::execute_task (this=0x2881dbc0, task=0x20005c9a2b90) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/realm/tasks.cc:1632
#19 0x00000000139d8950 in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x2881dbc0) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/realm/tasks.cc:1105
#20 0x00000000139e5cd0 in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop> (obj=0x2881dbc0) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/realm/threads.inl:97
#21 0x00000000139fc7d0 in Realm::UserThread::uthread_entry () at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/realm/threads.cc:1337
#22 0x000020000a737ffc in makecontext () from /lib64/power9/libc.so.6
#23 0x0000200c08000000 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
Failure 2:
Thread 11 (Thread 0x200052a6f890 (LWP 428008)):
#0 0x000020000a7ca114 in nanosleep () from /lib64/power9/libc.so.6
#1 0x000020000a7c9f44 in sleep () from /lib64/power9/libc.so.6
#2 0x000000001396336c in Realm::realm_freeze (signal=6) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/realm/runtime_impl.cc:179
#3 <signal handler called>
#4 0x000020000a723618 in raise () from /lib64/power9/libc.so.6
#5 0x000020000a703a2c in abort () from /lib64/power9/libc.so.6
#6 0x000020000a716f70 in __assert_fail_base () from /lib64/power9/libc.so.6
#7 0x000020000a717014 in __assert_fail () from /lib64/power9/libc.so.6
#8 0x0000000012346278 in Legion::Internal::Runtime::find_messenger (this=0x29de6100, sid=0) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/runtime.cc:20771
#9 0x000000001234e298 in Legion::Internal::Runtime::send_equivalence_set_replication_response (this=0x29de6100, target=0, rez=...) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/runtime.cc:22692
#10 0x0000000012f69330 in Legion::Internal::EquivalenceSet::process_replication_response (this=0x2022341d9e90, owner=33) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/legion_analysis.cc:12103
#11 0x0000000012f8d1e8 in Legion::Internal::EquivalenceSet::handle_replication_response (derez=..., runtime=0x29de6100) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/legion_analysis.cc:17383
#12 0x00000000123566a4 in Legion::Internal::Runtime::handle_equivalence_set_replication_response (this=0x29de6100, derez=...) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/runtime.cc:24951
#13 0x000000001231a850 in Legion::Internal::VirtualChannel::handle_messages (this=0x20007c245950, num_messages=1, runtime=0x29de6100, remote_address_space=33, args=0x200074eca6a0 "", arglen=28) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/runtime.cc:12794
#14 0x00000000123189bc in Legion::Internal::VirtualChannel::process_message (this=0x20007c245950, args=0x200074eca684, arglen=48, runtime=0x29de6100, remote_address_space=33) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/runtime.cc:11712
#15 0x000000001231b7dc in Legion::Internal::MessageManager::receive_message (this=0x20007c22de90, args=0x200074eca680, arglen=56) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/runtime.cc:13344
#16 0x000000001235b1dc in Legion::Internal::Runtime::process_message_task (this=0x29de6100, args=0x200074eca67c, arglen=60) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/runtime.cc:26035
#17 0x00000000123790c0 in Legion::Internal::Runtime::legion_runtime_task (args=0x200074eca670, arglen=64, userdata=0x2998ba30, userlen=8, p=...) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/runtime.cc:31932
#18 0x000000001393af20 in Realm::LocalTaskProcessor::execute_task (this=0x29d6c740, func_id=4, task_args=...) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/realm/proc_impl.cc:1135
#19 0x00000000139d4e68 in Realm::Task::execute_on_processor (this=0x200074e672d0, p=...) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/realm/tasks.cc:302
#20 0x00000000139db2a4 in Realm::UserThreadTaskScheduler::execute_task (this=0x293aa840, task=0x200074e672d0) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/realm/tasks.cc:1632
#21 0x00000000139d8950 in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x293aa840) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/realm/tasks.cc:1105
#22 0x00000000139e5cd0 in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop> (obj=0x293aa840) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/realm/threads.inl:97
#23 0x00000000139fc7d0 in Realm::UserThread::uthread_entry () at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/realm/threads.cc:1337
#24 0x000020000a737ffc in makecontext () from /lib64/power9/libc.so.6
#25 0x0000000000000000 in ?? ()
gdb2_1.txt
Another process on the same node was stuck at the previous line assert(sid < LEGION_MAX_NUM_NODES);
which is what I described yesterday.
Pull and try again to see if the particular crashes above are gone. I think I pushed a fix for them, but I still have not done the general work for tracing that needs to be done to fully support it.
The above are fixed and I was able to run 8 and 16 nodes. The 32 nodes failed with a similar behavior in relase mode as the issue I described in #1235. I ran in debug mode and here is the backtrace:
Thread 12 (Thread 0x20005793f890 (LWP 266729)):
#0 0x000020000a7ca114 in nanosleep () from /lib64/power9/libc.so.6
#1 0x000020000a7c9f44 in sleep () from /lib64/power9/libc.so.6
#2 0x0000000011839bc0 in gasneti_freezeForDebuggerNow () at /autofs/nccs-svm1_sw/summit/gcc/9.3.0-2/include/c++/9.3.0/ext/new_allocator.h:89
#3 0x000000001484ba0c in gasneti_freezeForDebuggerErr ()
#4 0x0000000011839fdc in gasneti_defaultSignalHandler () at /autofs/nccs-svm1_sw/summit/gcc/9.3.0-2/include/c++/9.3.0/ext/new_allocator.h:89
#5 <signal handler called>
#6 0x000020000a723618 in raise () from /lib64/power9/libc.so.6
#7 0x000020000a703a2c in abort () from /lib64/power9/libc.so.6
#8 0x000020000a716f70 in __assert_fail_base () from /lib64/power9/libc.so.6
#9 0x000020000a717014 in __assert_fail () from /lib64/power9/libc.so.6
#10 0x00000000131b8b38 in Legion::Internal::LayoutDescription::compute_copy_offsets (this=0x200f88d7b3c0, copy_mask=..., instance=..., fields=...) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/legion_instances.cc:363
#11 0x00000000131bcdbc in Legion::Internal::PhysicalManager::compute_copy_offsets (this=0x200f88d93da0, copy_mask=..., fields=...) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/legion_instances.cc:1141
#12 0x00000000120f9310 in Legion::Internal::IndividualView::copy_from (this=0x200f88d94240, src_view=0x201c5a8f6c20, precondition=..., predicate_guard=..., reduction_op_id=0, copy_expression=0x200f8bbefdc0, op=0x200f88d929c0, index=0, copy_mask=..., src_point=0x201c5a8f67f0, trace_info=..., recorded_events=..., applied_events=..., across_helper=0x0, manage_dst_events=true, copy_restricted=false, need_valid_return=false) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/legion_views.cc:2391
#13 0x0000000012f3fb2c in Legion::Internal::CopyFillAggregator::issue_copies (this=0x201c5a926710, target=0x200f88d94240, copies=..., recorded_events=..., precondition=..., copy_mask=..., trace_info=..., manage_dst_events=true, restricted_output=false, dst_events=0x0) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/legion_analysis.cc:5485
#14 0x0000000012f3e900 in Legion::Internal::CopyFillAggregator::perform_updates (this=0x201c5a926710, updates=..., trace_info=..., precondition=..., recorded_events=..., redop_index=-1, manage_dst_events=true, restricted_output=false, dst_events=0x0) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/legion_analysis.cc:5311
#15 0x0000000012f3defc in Legion::Internal::CopyFillAggregator::issue_updates (this=0x201c5a926710, trace_info=..., precondition=..., restricted_output=false, manage_dst_events=true, dst_events=0x0) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/legion_analysis.cc:5153
#16 0x0000000012f4c3c0 in Legion::Internal::UpdateAnalysis::perform_updates (this=0x200f88d94d90, perform_precondition=..., applied_events=..., already_deferred=true) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/legion_analysis.cc:7432
#17 0x0000000012f43c24 in Legion::Internal::PhysicalAnalysis::handle_deferred_update (args=0x200f88d95900) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/legion_analysis.cc:6163
#18 0x0000000012379b70 in Legion::Internal::Runtime::legion_runtime_task (args=0x200f88d95900, arglen=20, userdata=0x4c2306d0, userlen=8, p=...) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/runtime.cc:32377
#19 0x000000001393b1a0 in Realm::LocalTaskProcessor::execute_task (this=0x4c170df0, func_id=4, task_args=...) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/realm/proc_impl.cc:1135
#20 0x00000000139d50e8 in Realm::Task::execute_on_processor (this=0x200f88d95780, p=...) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/realm/tasks.cc:302
#21 0x00000000139db524 in Realm::UserThreadTaskScheduler::execute_task (this=0x4ad6e8d0, task=0x200f88d95780) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/realm/tasks.cc:1632
#22 0x00000000139d8bd0 in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x4ad6e8d0) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/realm/tasks.cc:1105
#23 0x00000000139e5f50 in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop> (obj=0x4ad6e8d0) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/realm/threads.inl:97
#24 0x00000000139fca50 in Realm::UserThread::uthread_entry () at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/realm/threads.cc:1337
#25 0x000020000a737ffc in makecontext () from /lib64/power9/libc.so.6
#26 0x0000000000000000 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
Attach a debugger and print out the following values in frame 10:
p/x copy_mask
p/x compressed
p/x allocated_fields
p found_in_cache
#8 0x00000000131b8b38 in Legion::Internal::LayoutDescription::compute_copy_offsets (this=0x201c4482bc80, copy_mask=..., instance=..., fields=...) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/legion_instances.cc:363
363 assert(pop_count == FieldMask::pop_count(copy_mask));
(gdb) p/x copy_mask
$2 = (const Legion::Internal::FieldMask &) @0x200076238140: {<BitMaskHelp::Heapify<PPCTLBitMask<512> >> = {<No data fields>}, static ELEMENT_SIZE = 0x40, static BIT_ELMTS = 0x8, static PPC_ELMTS = 0x4, static MAXSIZE = <optimized out>, bits = {static ELEMENT_SIZE = 0x40, bit_vector = {0x80, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, static SHIFT = <optimized out>, static MASK = <optimized out>}, sum_mask = 0x80}
(gdb) p/x compressed
$3 = {<BitMaskHelp::Heapify<PPCTLBitMask<512> >> = {<No data fields>}, static ELEMENT_SIZE = 0x40, static BIT_ELMTS = 0x8, static PPC_ELMTS = 0x4, static MAXSIZE = <optimized out>, bits = {static ELEMENT_SIZE = 0x40, bit_vector = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, static SHIFT = <optimized out>, static MASK = <optimized out>}, sum_mask = 0x0}
(gdb) p/x allocated_fields
$4 = {<BitMaskHelp::Heapify<PPCTLBitMask<512> >> = {<No data fields>}, static ELEMENT_SIZE = 0x40, static BIT_ELMTS = 0x8, static PPC_ELMTS = 0x4, static MAXSIZE = <optimized out>, bits = {static ELEMENT_SIZE = 0x40, bit_vector = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, static SHIFT = <optimized out>, static MASK = <optimized out>}, sum_mask = 0x0}
(gdb) p found_in_cache
$5 = false
This also looks like the mapper is doing something weird and making an instance without any fields in it. I can make Legion support that case, but in general instances without any fields in them seem kind of useless. Do you have a reason for making an instance without any fields in it?
There are maybe a few tasks during start-up for which I do that. This was an easy way to get a handle to the logical sub-regions. I remember in the past that I sometimes got errors when I didn't provide any fields. But I don't think I have such tasks when I actually start to iterate which is when it seems to fail. Is there a way for me to understand which region requirement of which task is triggering this?
If you go to frame 13 of the backtrace above, then you should be able to tell which task/operation and region requirement caused this particular copy. p this->op
will give you a pointer to the operation. You can do p this->op->get_operation_kind()
will tell you the kind. Presumably it is a task so then you can do something like p ((TaskOp*)this->op)->get_task_name()
and that should tell you the name of the task. If you set the provenance then p this->op->provenance->human
should show you the provenance string. You can get the index of the region requirement using p this->index
.
I'm still a little bit unsure how we even managed to be using a copy from an instance without any fields. Really that should never be happening. Maybe you'll get some semantic insight into how we ended up with a source instance that has empty fields. This only happens with 32 nodes and not at a smaller scale?
This only happens with 32 nodes and not at a smaller scale?
Yes.
I have a way to dump the re-ordered mesh so that I can skip this step for the next run. If I do that, it ran successfully.
I am trying to follow your instructions, but I am having a hard time making it freeze. I'll let you know when I have more insight.
I went to frame 13 and it said that there is no member or method named op.
Did you try this->op
? It is definitely there:
Were you able to find the operation using the debugger? If it doesn't work, try loading my gdb init file that can be found under my home directory on sapling. Run this in your debugger:
source .gdbinit
It only freezes once out of 20 times and I haven't been able to make it freeze again since last time. I'll try your gdb init file next time it freezes.
I tried your init file, was still unable to find this->op
.
What does p this
show the type is with my gdb plugin?
(gdb) p this
$2 = (Legion::Internal::CopyFillAggregator * const) 0x201c52499620
Ok, I forgot I refactored this. Try:
p this->analysis->op
I got this
#20 0x00000000131b8b38 in Legion::Internal::LayoutDescription::compute_copy_offsets (this=0x201c45171280, copy_mask=..., instance=..., fields=...) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/legion_instances.cc:363
(gdb) p/x copy_mask
$5 = (const Legion::Internal::FieldMask &) @0x201c4c4e1210: {
<BitMaskHelp::Heapify<PPCTLBitMask<512> >> = {<No data fields>},
members of PPCTLBitMask<512>:
bits = {
bit_vector = {0x80, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}
},
sum_mask = 0x80
}
(gdb) p/x compressed
$6 = {
<BitMaskHelp::Heapify<PPCTLBitMask<512> >> = {<No data fields>},
members of PPCTLBitMask<512>:
bits = {
bit_vector = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}
},
sum_mask = 0x0
}
(gdb) p/x allocated_fieldsQuit
(gdb) p/x allocated_fields
$7 = {
<BitMaskHelp::Heapify<PPCTLBitMask<512> >> = {<No data fields>},
members of PPCTLBitMask<512>:
bits = {
bit_vector = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}
},
sum_mask = 0x0
}
(gdb) p found_in_cache
$8 = false
#23 0x0000000012f3fb2c in Legion::Internal::CopyFillAggregator::issue_copies (this=0x200cdbc43930, target=0x201c5442f6b0, copies=..., recorded_events=..., precondition=..., copy_mask=..., trace_info=..., manage_dst_events=true, restricted_output=false, dst_events=0x0) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/legion_analysis.cc:5485
(gdb) p this->analysis->op
$2 = (Legion::Internal::RemoteTaskOp * const) 0x201c54c733b0
(gdb) p ( (RemoteTaskOp*) this->analysis->op )->get_task_name()
$3 = 0x20007c21e2f0 "pack_for_explicit_ghosts_task"
(gdb) p this->analysis->index
$4 = 0
Region requirement 0 of pack_for_explicit_ghosts_task asks for a second-level sub-region (with some fields) in a 2D index task lunch with a custom projection functor: point task (i,j) gets the j-th sub-region of the i-th first-level sub-region. The associated sub-domain can be empty, but the fact that it works if I start from a reordered mesh file makes me think this is not the problem. Point task (i,j) gets mapped to rank j.
What is the result of this (try to use my gdb plugin if you can)?
p this->analysis->analysis_expr->is_empty()
p this->analysis->analysis_expr->realm_index_space
p this->analysis->usage
It's pretty surprising to me that we can make it here with an empty copy_mask
. The only way I think that can happen is if we have an empty index space expression for the analysis as well (e.g. the index space for the region for this task is empty). The ordering on parallel analyses might also matter. Is this a read-only region requirement?
I got this
#11 0x0000000012f3fb2c in Legion::Internal::CopyFillAggregator::issue_copies (this=0x201c545e7db0, target=0x201c49790cd0, copies=..., recorded_events=..., precondition=...,
copy_mask=..., trace_info=..., manage_dst_events=true, restricted_output=false, dst_events=0x0)
at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/legion_analysis.cc:5485
warning: Source file is more recent than executable.
5485 //--------------------------------------------------------------------------
(gdb) p this->analysis->analysis_expr->is_empty()
$1 = false
(gdb) p this->analysis->analysis_expr->realm_index_space
$2 = {
bounds = {
lo = {
x = 858842
},
hi = {
x = 858945
}
},
sparsity = {
id = 0
}
}
(gdb) p this->analysis->usage
$3 = {
privilege = LEGION_READ_PRIV,
prop = LEGION_EXCLUSIVE,
redop = 0
}
Yes, it's a read-only region requirement.
Have you tried running this with the safe mapper checks: -lg:safe_mapper
? I'm trying to understand how it is that we were able to record an instance without any fields as a valid copy of the data for this non-empty region (we'd allow it if the region itself were empty).
I added the following flags: -lg:warn -lg:partcheck -lg:safe_ctrlrepl 1 -lg:safe_mapper
. Nothing was caught.
Can you capture the detailed Legion Spy logs from a failing run. Also get the following from the crashed thread above from the same run:
p this->analysis->op->unique_op_id
(frame 13)
p this->analysis->index
(frame 13)
p/x this->instance.id
(frame 11)
p copy_mask
(frame 12)
(gdb) p Realm::Network::my_node_id
$1 = 60
(gdb) up 9
#9 0x000000001319dfc8 in Legion::Internal::PhysicalManager::compute_copy_offsets (this=0x201c3ae2fc50, copy_mask=..., fields=...)
at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/legion_instances.cc:1132
1132 layout->compute_copy_offsets(copy_mask, instance, fields);
(gdb) p/x this->instance.id
$2 = 0x4010001000800022
(gdb) up
#10 0x00000000120d81dc in Legion::Internal::IndividualView::copy_from (this=0x201c3ae300d0, src_view=0x201c47825be0, precondition=..., predicate_guard=..., reduction_op_id=0,
copy_expression=0x201c47811910, op=0x201c3ae2f090, index=0, copy_mask=..., src_point=0x201c478257a0, trace_info=..., recorded_events=..., applied_events=...,
across_helper=0x0, manage_dst_events=true, copy_restricted=false, need_valid_return=false)
at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/legion_views.cc:2403
2403 manager->compute_copy_offsets(copy_mask, dst_fields);
(gdb) p copy_mask
$3 = (const Legion::Internal::FieldMask &) @0x201c52eb7cc0: {
<BitMaskHelp::Heapify<PPCTLBitMask<512> >> = {<No data fields>},
members of PPCTLBitMask<512>:
bits = {
bit_vector = {128, 0, 0, 0, 0, 0, 0, 0}
},
sum_mask = 128
}
(gdb) up
#11 0x0000000012f2a250 in Legion::Internal::CopyFillAggregator::issue_copies (this=0x201c3af4db20, target=0x201c3ae300d0, copies=..., recorded_events=..., precondition=...,
copy_mask=..., trace_info=..., manage_dst_events=true, restricted_output=false, dst_events=0x0)
at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/legion_analysis.cc:5416
5416 track_events);
(gdb) p this->analysis->index
$4 = 0
(gdb) p this->analysis->op->unique_op_id
$5 = 128128
The associated logs can be found at /scratch2/bandokihiro/Issue1344/logs_20221202
.
Ok, it looks like this instance is not actually empty. Somehow the layout description for the instance on this remote node is not picking up the fact that there are valid fields in the instance and that is causing the problem. In frame 10 of the above stack trace, let's see what the output of following show:
p this->allocated_fields
p *(this->constraints)
If you can use my gdb plugin for STL data structures, then let's also see what the output of this is:
pvector this->constraints->field_constraint.field_set
If possible do this with Legion Spy logging and get new logs as well as the above data when it crashes. You can overwrite the old log files since they are a lot of data.
Please find the logs at /scratch2/bandokihiro/Issue1344/logs_20221214
.
Since gdb's output was rather long, I also left gdb.txt
there with the outputs of the commands you wanted me to try.
Can you dumb Legion Spy logs of just the part of the code that does the initialization when you're NOT restarting from a previous checkpoint? Do not run any of the simulation looping, we just want the initialization part. Somehow an instance with no fields is being recorded as a valid copy of the data for a field. It seems like it has to happen during the initialization since you can't reproduce the error when you restart from an existing checkpoint.
I put the logs of the initialization phase here /scratch2/bandokihiro/Issue1344/logs_20221217
.
I pushed a fix for a related bug to the collective
branch. Can you pull and see if it fixes your issue as well?
I pulled and it did not fix the issue.
Is there gdb output for the files above?
The last set of logs do not contain the iterative part of the solve, so it exited cleanly.
Ok, not particularly helpful then since it's hard to track down these instances. Let's try a different tact. Let's apply this patch and see where we're making these layouts with no fields as that should really never be happening. Check to see if you hit either of these assertions at smaller node counts before going up to larger node counts:
diff --git a/runtime/legion/legion_instances.cc b/runtime/legion/legion_instances.cc
index 8189ef92d..a096a10d5 100644
--- a/runtime/legion/legion_instances.cc
+++ b/runtime/legion/legion_instances.cc
@@ -241,6 +241,7 @@ namespace Legion {
: allocated_fields(mask), constraints(con), owner(own), total_dims(dims)
//--------------------------------------------------------------------------
{
+ assert(!!allocated_fields);
constraints->add_base_gc_ref(LAYOUT_DESC_REF);
field_infos.resize(field_sizes.size());
// Switch data structures from layout by field order to order
@@ -271,6 +272,7 @@ namespace Legion {
: allocated_fields(mask), constraints(con), owner(NULL), total_dims(0)
//--------------------------------------------------------------------------
{
+ assert(!!allocated_fields);
constraints->add_base_gc_ref(LAYOUT_DESC_REF);
}
I have been trying for a while and I haven't been able to make it hang. Similarly, with GASNET_BACKTRACE=1
, the job gets killed before the program has time to invoke gdb.
As in it crashes and then doesn't freeze? If REALM_FREEZE_ON_ERROR=1
isn't working then you can also try GASNET_FREEZE_ON_ERROR=1
. They are fundamentally doing the same thing though, so I might not expect very different behavior. Usually the reason that it would die like that with the freeze-on-error is if one of the processes actually exits cleanly, and then the runner (e.g. mpirun) will come through and spike the other processes. Is there any evidence of that happening (mpirun
will usually tell you that it's doing it)? Other than that, the only other thing that might be happening is an OOM error where the OS is spiking a process, but I would be surprised if that is happening in this case.
The original failure mode though is still reproducible?
The original failure mode was already difficult to make hang. The assertion that we added has been even more difficult (I wasn't able to make it hang once). I see the "going to freeze" message for a subset of processes, then the job gets killed a few seconds later. I tried both the realm and the gasnet env variable (from my experience, gasnet's env variable is a bit more reliable). I don't think one of the processes can exit cleanly since data need to move around at each iteration and I am setting up the problem for 20 iterations.
Are you still observing this bug?
Yes. I ran in release mode and observed the same failure mode as documented in #1235.
/gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/realm/transfer/transfer.cc:4015: void Realm::TransferDesc::perform_analysis(): Assertion `srcs.size() == dsts.size()' failed.
My debugging attempt stopped when I wasn't able to make the program hang with the above patch that you gave me.
I played with the collective branch on sapling. Since the mode of failure changed compared to last time I tried (mapper error) and for future debugging if needed, I'll report the errors I am triggering here.
The backtrace associated with the current mode of failure is the following:
The command line was the following