StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
658 stars 146 forks source link

Legion: Assertion failure in modified Circuit #1618

Closed rupanshusoi closed 4 months ago

rupanshusoi commented 5 months ago

I'm running a modified version of Regent Circuit on the latest control_replication (98f6f2) on Sapling. 1-node runs work fine, but on 2-nodes I get:

circuit_ON_750_0: /home/rupanshu/legion/runtime/legion/legion_analysis.cc:4640: void Legion::Internal::LogicalState::sanity_check() const: Assertion `!(finder->second & it->second)' failed. 

This version of Circuit is modified such that the main inner-loop is outlined into a new wrapper task that is launched by the top-level task. The wrapper task is not control-replicated in this run. The rest of the code is basically the same.

Full backtrace:

circuit_ON_750_0: /home/rupanshu/legion/runtime/legion/legion_analysis.cc:4640: void Legion::Internal::LogicalState::sanity_check() const: Assertion `!(finder->second & it->second)' failed.
*** Caught a fatal signal (proc 0): SIGABRT(6)
*** NOTICE (proc 0): We recommend linking the debug version of GASNet to assist you in resolving this application issue.
[0] Invoking GDB for backtrace...
[0] /usr/bin/gdb -nx -batch -x /tmp/gasnet_IeN9Hy '/home/rupanshu/restart/regent-circuit/./build/circuit_ON_750_0.dir/circuit_ON_750_0' 916195
[0] [New LWP 916201]
[0] [New LWP 916202]
[0] [New LWP 916212]
[0] [New LWP 916213]
[0] [New LWP 916214]
[0] [New LWP 916215]
[0] [New LWP 916216]
[0] [New LWP 916217]
[0] [New LWP 916218]
[0] [New LWP 916219]
[0] [New LWP 916220]
[0] [New LWP 916221]
[0] [New LWP 916222]
[0] [Thread debugging using libthread_db enabled]
[0] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[0] syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
[0]   Id   Target Id                                            Frame 
[0] * 1    Thread 0x7f0007e35c80 (LWP 916195) "circuit_ON_750_" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
[0]   2    Thread 0x7f0007e0b700 (LWP 916201) "circuit_ON_750_" 0x00007f000c5d4bbf in __GI___poll (fds=0x7f0000000b60, nfds=2, timeout=3599920) at ../sysdeps/unix/sysv/linux/poll.c:29
[0]   3    Thread 0x7f00073ea700 (LWP 916202) "circuit_ON_750_" 0x00007f000c5e168e in epoll_wait (epfd=11, events=0x55fd25522280, maxevents=32, timeout=119952) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
[0]   4    Thread 0x7effff57b700 (LWP 916212) "cuda-EvtHandlr"  0x00007f000c5d4bbf in __GI___poll (fds=0x55fd258fbe70, nfds=2, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
[0]   5    Thread 0x7efffdda0700 (LWP 916213) "cuda-EvtHandlr"  0x00007f000c5d4bbf in __GI___poll (fds=0x7effd4000c70, nfds=13, timeout=100) at ../sysdeps/unix/sysv/linux/poll.c:29
[0]   6    Thread 0x7efffd0fdc80 (LWP 916214) "circuit_ON_750_" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
[0]   7    Thread 0x7efffcff9c80 (LWP 916215) "circuit_ON_750_" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
[0]   8    Thread 0x7efffcef5c80 (LWP 916216) "circuit_ON_750_" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
[0]   9    Thread 0x7efffcdf1c80 (LWP 916217) "circuit_ON_750_" Realm::IntrusiveList<Realm::XmitSrcDestPair, &Realm::XmitSrcDestPair::xpair_list_link, Realm::DummyLock>::empty (this=0x55fd25490ed8) at /home/rupanshu/legion/runtime/realm/lists.inl:142
[0]   10   Thread 0x7f0006048c80 (LWP 916218) "circuit_ON_750_" 0x00007f000c5a4c7f in __GI___wait4 (pid=916223, stat_loc=stat_loc@entry=0x7ef492912858, options=options@entry=0, usage=usage@entry=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:27
[0]   11   Thread 0x7f0005f87c80 (LWP 916219) "circuit_ON_750_" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
[0]   12   Thread 0x7efffccedc80 (LWP 916220) "circuit_ON_750_" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
[0]   13   Thread 0x7f0004080c80 (LWP 916221) "circuit_ON_750_" 0x00007f0011dd80a4 in std::_Vector_base<unsigned int, std::allocator<unsigned int> >::_M_allocate(unsigned long)@plt () from /home/rupanshu/legion/language/build/lib/libregent.so.1
[0]   14   Thread 0x7efffcae9c80 (LWP 916222) "circuit_ON_750_" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
[0] 
[0] Thread 14 (Thread 0x7efffcae9c80 (LWP 916222)):
[0] #0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
[0] #1  0x00007f000d28ece3 in Realm::Doorbell::wait_slow (this=0x7efffcae9800) at /home/rupanshu/legion/runtime/realm/mutex.cc:264
[0] #2  0x00007f000cffa8aa in Realm::Doorbell::wait (this=0x7efffcae9800) at /home/rupanshu/legion/runtime/realm/mutex.inl:81
[0] #3  0x00007f000d1a28d3 in Realm::ThreadedTaskScheduler::WorkCounter::wait_for_work (this=0x55fd26b03df0, old_counter=3) at /home/rupanshu/legion/runtime/realm/tasks.cc:719
[0] #4  0x00007f000d1a4de6 in Realm::ThreadedTaskScheduler::wait_for_work (this=0x55fd26b03c30, old_work_counter=3) at /home/rupanshu/legion/runtime/realm/tasks.cc:1294
[0] #5  0x00007f000d1a5e28 in Realm::KernelThreadTaskScheduler::wait_for_work (this=0x55fd26b03c30, old_work_counter=3) at /home/rupanshu/legion/runtime/realm/tasks.cc:1528
[0] #6  0x00007f000d1a4beb in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x55fd26b03c30) at /home/rupanshu/legion/runtime/realm/tasks.cc:1260
[0] #7  0x00007f000d1a4cdf in Realm::ThreadedTaskScheduler::scheduler_loop_wlock (this=0x55fd26b03c30) at /home/rupanshu/legion/runtime/realm/tasks.cc:1272
[0] #8  0x00007f000d1acde6 in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop_wlock> (obj=0x55fd26b03c30) at /home/rupanshu/legion/runtime/realm/threads.inl:97
[0] #9  0x00007f000d1ba60e in Realm::KernelThread::pthread_entry (data=0x55fd2a2bc1f0) at /home/rupanshu/legion/runtime/realm/threads.cc:831
[0] #10 0x00007f000a673609 in start_thread (arg=<optimized out>) at pthread_create.c:477
[0] #11 0x00007f000c5e1353 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
[0] 
[0] Thread 13 (Thread 0x7f0004080c80 (LWP 916221)):
[0] #0  0x00007f0011dd80a4 in std::_Vector_base<unsigned int, std::allocator<unsigned int> >::_M_allocate(unsigned long)@plt () from /home/rupanshu/legion/language/build/lib/libregent.so.1
[0] #1  0x00007f0011dec1ad in std::_Vector_base<unsigned int, std::allocator<unsigned int> >::_M_create_storage (this=0x7ef4a2944350, __n=7) at /usr/include/c++/9/bits/stl_vector.h:358
[0] #2  0x00007f0011de7ad3 in std::_Vector_base<unsigned int, std::allocator<unsigned int> >::_Vector_base (this=0x7ef4a2944350, __n=7, __a=...) at /usr/include/c++/9/bits/stl_vector.h:302
[0] #3  0x00007f000fd24429 in std::vector<unsigned int, std::allocator<unsigned int> >::vector (this=0x7ef4a2944350, __x=std::vector of length 7, capacity 8 = {...}) at /usr/include/c++/9/bits/stl_vector.h:552
[0] #4  0x00007f001013130f in Legion::RegionRequirement::RegionRequirement (this=0x7ef4a29442f8, rhs=...) at /home/rupanshu/legion/runtime/legion/legion.cc:1067
[0] #5  0x00007f000fdaa942 in std::_Construct<Legion::RegionRequirement, Legion::RegionRequirement const&> (__p=0x7ef4a29442f8) at /usr/include/c++/9/bits/stl_construct.h:75
[0] #6  0x00007f000ffde813 in std::__uninitialized_copy<false>::__uninit_copy<Legion::RegionRequirement const*, Legion::RegionRequirement*> (__first=0x7ef4a2925708, __last=0x7ef4a29257c0, __result=0x7ef4a29440d0) at /usr/include/c++/9/bits/stl_uninitialized.h:83
[0] #7  0x00007f000ffd7b74 in std::uninitialized_copy<Legion::RegionRequirement const*, Legion::RegionRequirement*> (__first=0x7ef4a29254e0, __last=0x7ef4a29257c0, __result=0x7ef4a29440d0) at /usr/include/c++/9/bits/stl_uninitialized.h:140
[0] #8  0x00007f000ffce40e in std::__uninitialized_copy_a<Legion::RegionRequirement const*, Legion::RegionRequirement*, Legion::RegionRequirement> (__first=0x7ef4a29254e0, __last=0x7ef4a29257c0, __result=0x7ef4a29440d0) at /usr/include/c++/9/bits/stl_uninitialized.h:307
[0] #9  0x00007f000ffc0925 in std::__uninitialized_move_if_noexcept_a<Legion::RegionRequirement*, Legion::RegionRequirement*, std::allocator<Legion::RegionRequirement> > (__first=0x7ef4a29254e0, __last=0x7ef4a29257c0, __result=0x7ef4a29440d0, __alloc=...) at /usr/include/c++/9/bits/stl_uninitialized.h:329
[0] #10 0x00007f000ffb48f1 in std::vector<Legion::RegionRequirement, std::allocator<Legion::RegionRequirement> >::_M_realloc_insert<Legion::RegionRequirement const&> (this=0x7ef4a2909b18, __position={region = {static NO_REGION = {static NO_REGION = <same as static member of an already seen type>, tree_id = 0, index_space = {static NO_SPACE = {static NO_SPACE = <same as static member of an already seen type>, id = 0, tid = 0, type_tag = 0}, id = 0, tid = 0, type_tag = 0}, field_space = {static NO_SPACE = {static NO_SPACE = <same as static member of an already seen type>, id = 0}, id = 0}}, tree_id = 0, index_space = {static NO_SPACE = {static NO_SPACE = <same as static member of an already seen type>, id = 0, tid = 0, type_tag = 0}, id = 0, tid = 389, type_tag = 0}, field_space = {static NO_SPACE = {static NO_SPACE = <same as static member of an already seen type>, id = 0}, id = 2727498592}}, partition = {static NO_PART = {static NO_PART = <same as static member of an already seen type>, tree_id = 0, index_pa[0] rtition = {static NO_PART = {static NO_PART = <same as static member of an already seen type>, id = 0, tid = 0, type_tag = 0}, id = 0, tid = 0, type_tag = 0}, field_space = {static NO_SPACE = {static NO_SPACE = <same as static member of an already seen type>, id = 0}, id = 0}}, tree_id = 32500, index_partition = {static NO_PART = {static NO_PART = <same as static member of an already seen type>, id = 0, tid = 0, type_tag = 0}, id = 2684356816, tid = 32500, type_tag = 0}, field_space = {static NO_SPACE = {static NO_SPACE = <same as static member of an already seen type>, id = 0}, id = 2}}, privilege_fields = std::set with 139589168031184 elements<error reading variable: Cannot access memory at address 0x18>, instance_fields = std::vector of length -34897292007795, capacity 56 = {1, 32500, 2730911136, 32500, 0, 0, 0, 0, 101, 0, 53, 0, 1, 32500, 2727499024, 32500, 0, 0, 0, 0, 104, 0, 69, 0, 0, 0, 2684356816, 32500, 0, 0, 2730913600, 32500, 103, 0, 0, 0, 64, 0, 37, 0, 2728228000, 32500, 0, 0, 2416134672, 325[0] 00, 37, 0, 104, 32500, 0, 0, 0, 0, 37, 0, 104, 32500, 0, 0, 32, 0, 37, 0, 101, 102, 0, 0, 0, 0, 53, 0, 0, 32500, 2730878736, 32500, 0, 0, 0, 0, 104, 0, 53, 0, 1, 32500, 2730917072, 32500, 0, 0, 2730911568, 32500, 101, 2147483648, 53, 0, 0, 32500, 2730911520, 32500, 0, 0, 0, 0, 102, 0, 53, 0, 1, 32500, 2730917256, 32500, 0, 0, 2730911664, 32500, 103, 0, 53, 0, 0, 32500, 2730911616, 32500, 0, 0, 0, 0, 104, 0, 37, 0, 103, 104, 2684354688, 32500, 2730911712, 32500, 53, 0, 1, 32500, 2730917440, 32500, 2730911888, 32500, 2730911792, 32500, 102, 0, 53, 0, 1, 32500, 2730911744, 32500, 0, 0, 2730911840, 32500, 103, 469762048, 53, 0, 0, 32500, 2730911792, 32500, 0, 0, 0, 0, 104, 32500, 53, 0, 1, 32500, 2730911744, 32500, 0, 0, 0, 0, 101, 0, 37, 0, 101, 102, 103, 104, 2730911944, 32500, 37, 0, 1, 32500, 2730911008, 32500...}, privilege = -1564055884, prop = 32500, parent = {static NO_REGION = {static NO_REGION = <same as static member of an already seen type>, tree_id = 0, index_space = {static NO_SPACE = {static N[0] O_SPACE = <same as static member of an already seen type>, id = 0, tid = 0, type_tag = 0}, id = 0, tid = 0, type_tag = 0}, field_space = {static NO_SPACE = {static NO_SPACE = <same as static member of an already seen type>, id = 0}, id = 0}}, tree_id = 2730911412, index_space = {static NO_SPACE = {static NO_SPACE = <same as static member of an already seen type>, id = 0, tid = 0, type_tag = 0}, id = 32500, tid = 1, type_tag = 0}, field_space = {static NO_SPACE = {static NO_SPACE = <same as static member of an already seen type>, id = 0}, id = 2}}, redop = 7, tag = 1108101562373, flags = LEGION_NO_ACCESS_FLAG, handle_type = LEGION_SINGULAR_PROJECTION, projection = 0, projection_args = 0x100000000, projection_args_size = 0}) at /usr/include/c++/9/bits/vector.tcc:474
[0] #11 0x00007f000ffaac75 in std::vector<Legion::RegionRequirement, std::allocator<Legion::RegionRequirement> >::push_back (this=0x7ef4a2909b18, __x=...) at /usr/include/c++/9/bits/stl_vector.h:1195
[0] #12 0x00007f000ffa2ed1 in Legion::IndexTaskLauncher::add_region_requirement (this=0x7ef4a2909aa0, req=...) at /home/rupanshu/legion/runtime/legion/legion.inl:18260
[0] #13 0x00007f000ff8e21d in legion_index_launcher_add_region_requirement_logical_partition (launcher_=..., handle_=..., proj=0, priv=LEGION_READ_WRITE, prop=LEGION_EXCLUSIVE, parent_=..., tag=0, verified=false) at /home/rupanshu/legion/runtime/legion/legion_c.cc:4074
[0] #14 0x000055fd24a90e30 in $<wrapper> ()
[0] #15 0x000055fd24a8c224 in $__regent_task_wrapper_primary ()
[0] #16 0x00007f000d11d110 in Realm::LocalTaskProcessor::execute_task (this=0x55fd26bccc90, func_id=50, task_args=...) at /home/rupanshu/legion/runtime/realm/proc_impl.cc:1176
[0] #17 0x00007f000d1a154e in Realm::Task::execute_on_processor (this=0x7ef4a2928ca0, p=...) at /home/rupanshu/legion/runtime/realm/tasks.cc:326
[0] #18 0x00007f000d1a68f2 in Realm::UserThreadTaskScheduler::execute_task (this=0x55fd26bcd080, task=0x7ef4a2928ca0) at /home/rupanshu/legion/runtime/realm/tasks.cc:1687
[0] #19 0x00007f000d1a468e in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x55fd26bcd080) at /home/rupanshu/legion/runtime/realm/tasks.cc:1160
[0] #20 0x00007f000d1ad12a in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop> (obj=0x55fd26bcd080) at /home/rupanshu/legion/runtime/realm/threads.inl:97
[0] #21 0x00007f000d1bc4aa in Realm::UserThread::uthread_entry () at /home/rupanshu/legion/runtime/realm/threads.cc:1405
[0] #22 0x00007f000c51d4e0 in ?? () at ../sysdeps/unix/sysv/linux/x86_64/__start_context.S:91 from /lib/x86_64-linux-gnu/libc.so.6
[0] #23 0x0000000000000000 in ?? ()
[0] 
[0] Thread 12 (Thread 0x7efffccedc80 (LWP 916220)):
[0] #0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
[0] #1  0x00007f000d28ece3 in Realm::Doorbell::wait_slow (this=0x7efffcced800) at /home/rupanshu/legion/runtime/realm/mutex.cc:264
[0] #2  0x00007f000cffa8aa in Realm::Doorbell::wait (this=0x7efffcced800) at /home/rupanshu/legion/runtime/realm/mutex.inl:81
[0] #3  0x00007f000d1a28d3 in Realm::ThreadedTaskScheduler::WorkCounter::wait_for_work (this=0x55fd2559df80, old_counter=1) at /home/rupanshu/legion/runtime/realm/tasks.cc:719
[0] #4  0x00007f000d1a4de6 in Realm::ThreadedTaskScheduler::wait_for_work (this=0x55fd2559ddc0, old_work_counter=1) at /home/rupanshu/legion/runtime/realm/tasks.cc:1294
[0] #5  0x00007f000d1a5e28 in Realm::KernelThreadTaskScheduler::wait_for_work (this=0x55fd2559ddc0, old_work_counter=1) at /home/rupanshu/legion/runtime/realm/tasks.cc:1528
[0] #6  0x00007f000d1a4beb in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x55fd2559ddc0) at /home/rupanshu/legion/runtime/realm/tasks.cc:1260
[0] #7  0x00007f000d1a4cdf in Realm::ThreadedTaskScheduler::scheduler_loop_wlock (this=0x55fd2559ddc0) at /home/rupanshu/legion/runtime/realm/tasks.cc:1272
[0] #8  0x00007f000d1acde6 in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop_wlock> (obj=0x55fd2559ddc0) at /home/rupanshu/legion/runtime/realm/threads.inl:97
[0] #9  0x00007f000d1ba60e in Realm::KernelThread::pthread_entry (data=0x55fd2a1f97c0) at /home/rupanshu/legion/runtime/realm/threads.cc:831
[0] #10 0x00007f000a673609 in start_thread (arg=<optimized out>) at pthread_create.c:477
[0] #11 0x00007f000c5e1353 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
[0] 
[0] Thread 11 (Thread 0x7f0005f87c80 (LWP 916219)):
[0] #0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
[0] #1  0x00007f000d28ece3 in Realm::Doorbell::wait_slow (this=0x7f0005f87800) at /home/rupanshu/legion/runtime/realm/mutex.cc:264
[0] #2  0x00007f000cffa8aa in Realm::Doorbell::wait (this=0x7f0005f87800) at /home/rupanshu/legion/runtime/realm/mutex.inl:81
[0] #3  0x00007f000d1a28d3 in Realm::ThreadedTaskScheduler::WorkCounter::wait_for_work (this=0x55fd26bcc640, old_counter=383) at /home/rupanshu/legion/runtime/realm/tasks.cc:719
[0] #4  0x00007f000d1a4de6 in Realm::ThreadedTaskScheduler::wait_for_work (this=0x55fd26bcc480, old_work_counter=383) at /home/rupanshu/legion/runtime/realm/tasks.cc:1294
[0] #5  0x00007f000d1a6c32 in Realm::UserThreadTaskScheduler::wait_for_work (this=0x55fd26bcc480, old_work_counter=383) at /home/rupanshu/legion/runtime/realm/tasks.cc:1795
[0] #6  0x00007f000d1a4beb in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x55fd26bcc480) at /home/rupanshu/legion/runtime/realm/tasks.cc:1260
[0] #7  0x00007f000d1ad12a in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop> (obj=0x55fd26bcc480) at /home/rupanshu/legion/runtime/realm/threads.inl:97
[0] #8  0x00007f000d1bc4aa in Realm::UserThread::uthread_entry () at /home/rupanshu/legion/runtime/realm/threads.cc:1405
[0] #9  0x00007f000c51d4e0 in ?? () at ../sysdeps/unix/sysv/linux/x86_64/__start_context.S:91 from /lib/x86_64-linux-gnu/libc.so.6
[0] #10 0x0000000000000000 in ?? ()
[0] 
[0] Thread 10 (Thread 0x7f0006048c80 (LWP 916218)):
[0] #0  0x00007f000c5a4c7f in __GI___wait4 (pid=916223, stat_loc=stat_loc@entry=0x7ef492912858, options=options@entry=0, usage=usage@entry=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:27
[0] #1  0x00007f000c5a4bfb in __GI___waitpid (pid=<optimized out>, stat_loc=stat_loc@entry=0x7ef492912858, options=options@entry=0) at waitpid.c:38
[0] #2  0x00007f000c513f67 in do_system (line=<optimized out>) at ../sysdeps/posix/system.c:172
[0] #3  0x00007f000d889063 in gasneti_bt_gdb () from /home/rupanshu/legion/language/build/lib/librealm.so.1
[0] #4  0x00007f000d88ce8e in gasneti_print_backtrace () from /home/rupanshu/legion/language/build/lib/librealm.so.1
[0] #5  0x00007f000cccbb33 in gasneti_defaultSignalHandler () from /home/rupanshu/legion/language/build/lib/librealm.so.1
[0] #6  <signal handler called>
[0] #7  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
[0] #8  0x00007f000c4e4859 in __GI_abort () at abort.c:79
[0] #9  0x00007f000c4e4729 in __assert_fail_base (fmt=0x7f000c67a588 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x7f0010e08418 "!(finder->second & it->second)", file=0x7f0010e05f60 "/home/rupanshu/legion/runtime/legion/legion_analysis.cc", line=4640, function=<optimized out>) at assert.c:92
[0] #10 0x00007f000c4f5fd6 in __GI___assert_fail (assertion=0x7f0010e08418 "!(finder->second & it->second)", file=0x7f0010e05f60 "/home/rupanshu/legion/runtime/legion/legion_analysis.cc", line=4640, function=0x7f0010e083d8 "void Legion::Internal::LogicalState::sanity_check() const") at assert.c:101
[0] #11 0x00007f000fe1ee9d in Legion::Internal::LogicalState::sanity_check (this=0x7ef4900361c0) at /home/rupanshu/legion/runtime/legion/legion_analysis.cc:4640
[0] #12 0x00007f00105750c0 in Legion::Internal::RegionTreeNode::merge_new_field_states (this=0x7ef4a2921540, state=..., new_states=std::deque with 1 element = {...}) at /home/rupanshu/legion/runtime/legion/region_tree.cc:16639
[0] #13 0x00007f0010573cc2 in Legion::Internal::RegionTreeNode::siphon_interfering_children (this=0x7ef4a2921540, state=..., analysis=..., closing_mask=..., user=..., privilege_root=..., next_child=0x7ef4a2921ac0, open_below=...) at /home/rupanshu/legion/runtime/legion/region_tree.cc:16424
[0] #14 0x00007f0010571fbe in Legion::Internal::RegionTreeNode::register_logical_user (this=0x7ef4a2921540, privilege_root=..., user=..., path=..., trace_info=..., proj_info=..., user_mask=..., unopened_field_mask=..., refinement_mask=..., logical_analysis=..., refinements=..., root_node=false) at /home/rupanshu/legion/runtime/legion/region_tree.cc:16000
[0] #15 0x00007f00105724e4 in Legion::Internal::RegionTreeNode::register_logical_user (this=0x7ef4a291d690, privilege_root=..., user=..., path=..., trace_info=..., proj_info=..., user_mask=..., unopened_field_mask=..., refinement_mask=..., logical_analysis=..., refinements=..., root_node=false) at /home/rupanshu/legion/runtime/legion/region_tree.cc:16088
[0] #16 0x00007f00105724e4 in Legion::Internal::RegionTreeNode::register_logical_user (this=0x7ef4a290d1d0, privilege_root=..., user=..., path=..., trace_info=..., proj_info=..., user_mask=..., unopened_field_mask=..., refinement_mask=..., logical_analysis=..., refinements=..., root_node=true) at /home/rupanshu/legion/runtime/legion/region_tree.cc:16088
[0] #17 0x00007f001052faff in Legion::Internal::RegionTreeForest::perform_dependence_analysis (this=0x55fd27cd83a0, op=0x7ef4a2927720, idx=1, req=..., proj_info=..., logical_analysis=...) at /home/rupanshu/legion/runtime/legion/region_tree.cc:1649
[0] #18 0x00007f001018c3e1 in Legion::Internal::Operation::analyze_region_requirements (this=0x7ef4a2927720, launch_space=0x0, func=0x0, shard_space=...) at /home/rupanshu/legion/runtime/legion/legion_ops.cc:1298
[0] #19 0x00007f00103b9175 in Legion::Internal::IndividualTask::trigger_dependence_analysis (this=0x7ef4a2927570) at /home/rupanshu/legion/runtime/legion/legion_tasks.cc:6057
[0] #20 0x00007f00107aa9b6 in Legion::Internal::Predicated<Legion::Internal::IndividualTask>::trigger_dependence_analysis (this=0x7ef4a2927570) at /home/rupanshu/legion/runtime/legion/legion_ops.inl:170
[0] #21 0x00007f001018d641 in Legion::Internal::Operation::execute_dependence_analysis (this=0x7ef4a2927720) at /home/rupanshu/legion/runtime/legion/legion_ops.cc:1631
[0] #22 0x00007f0010036b51 in Legion::Internal::InnerContext::process_dependence_stage (this=0x7ef490039460) at /home/rupanshu/legion/runtime/legion/legion_context.cc:8448
[0] #23 0x00007f00100472d6 in Legion::Internal::InnerContext::handle_dependence_stage (args=0x7ef4a2945500) at /home/rupanshu/legion/runtime/legion/legion_context.cc:12305
[0] #24 0x00007f001065fc6d in Legion::Internal::Runtime::legion_runtime_task (args=0x7ef4a2945500, arglen=12, userdata=0x55fd2a1f3c60, userlen=8, p=...) at /home/rupanshu/legion/runtime/legion/runtime.cc:32194
[0] #25 0x00007f000d11d110 in Realm::LocalTaskProcessor::execute_task (this=0x55fd26bcbc20, func_id=4, task_args=...) at /home/rupanshu/legion/runtime/realm/proc_impl.cc:1176
[0] #26 0x00007f000d1a154e in Realm::Task::execute_on_processor (this=0x7ef4a2945380, p=...) at /home/rupanshu/legion/runtime/realm/tasks.cc:326
[0] #27 0x00007f000d1a68f2 in Realm::UserThreadTaskScheduler::execute_task (this=0x55fd254d7aa0, task=0x7ef4a2945380) at /home/rupanshu/legion/runtime/realm/tasks.cc:1687
[0] #28 0x00007f000d1a468e in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x55fd254d7aa0) at /home/rupanshu/legion/runtime/realm/tasks.cc:1160
[0] #29 0x00007f000d1ad12a in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop> (obj=0x55fd254d7aa0) at /home/rupanshu/legion/runtime/realm/threads.inl:97
[0] #30 0x00007f000d1bc4aa in Realm::UserThread::uthread_entry () at /home/rupanshu/legion/runtime/realm/threads.cc:1405
[0] #31 0x00007f000c51d4e0 in ?? () at ../sysdeps/unix/sysv/linux/x86_64/__start_context.S:91 from /lib/x86_64-linux-gnu/libc.so.6
[0] #32 0x0000000000000000 in ?? ()
[0] 
[0] Thread 9 (Thread 0x7efffcdf1c80 (LWP 916217)):
[0] #0  Realm::IntrusiveList<Realm::XmitSrcDestPair, &Realm::XmitSrcDestPair::xpair_list_link, Realm::DummyLock>::empty (this=0x55fd25490ed8) at /home/rupanshu/legion/runtime/realm/lists.inl:142
[0] #1  0x00007f000d25a236 in Realm::GASNetEXPoller::do_work (this=0x55fd25490df0, work_until=...) at /home/rupanshu/legion/runtime/realm/gasnetex/gasnetex_internal.cc:2810
[0] #2  0x00007f000cffa44c in Realm::BackgroundWorkManager::Worker::do_work (this=0x7efffcdf09c0, max_time_in_ns=-1, interrupt_flag=0x0) at /home/rupanshu/legion/runtime/realm/bgwork.cc:599
[0] #3  0x00007f000cff7d86 in Realm::BackgroundWorkThread::main_loop (this=0x55fd26bcb930) at /home/rupanshu/legion/runtime/realm/bgwork.cc:103
[0] #4  0x00007f000cffbaf4 in Realm::Thread::thread_entry_wrapper<Realm::BackgroundWorkThread, &Realm::BackgroundWorkThread::main_loop> (obj=0x55fd26bcb930) at /home/rupanshu/legion/runtime/realm/threads.inl:97
[0] #5  0x00007f000d1ba60e in Realm::KernelThread::pthread_entry (data=0x55fd257d0960) at /home/rupanshu/legion/runtime/realm/threads.cc:831
[0] #6  0x00007f000a673609 in start_thread (arg=<optimized out>) at pthread_create.c:477
[0] #7  0x00007f000c5e1353 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
[0] 
[0] Thread 8 (Thread 0x7efffcef5c80 (LWP 916216)):
[0] #0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
[0] #1  0x00007f000d28ece3 in Realm::Doorbell::wait_slow (this=0x7efffcef5800) at /home/rupanshu/legion/runtime/realm/mutex.cc:264
[0] #2  0x00007f000cffa8aa in Realm::Doorbell::wait (this=0x7efffcef5800) at /home/rupanshu/legion/runtime/realm/mutex.inl:81
[0] #3  0x00007f000cff7f21 in Realm::BackgroundWorkThread::main_loop (this=0x55fd26bcb760) at /home/rupanshu/legion/runtime/realm/bgwork.cc:144
[0] #4  0x00007f000cffbaf4 in Realm::Thread::thread_entry_wrapper<Realm::BackgroundWorkThread, &Realm::BackgroundWorkThread::main_loop> (obj=0x55fd26bcb760) at /home/rupanshu/legion/runtime/realm/threads.inl:97
[0] #5  0x00007f000d1ba60e in Realm::KernelThread::pthread_entry (data=0x55fd25916e30) at /home/rupanshu/legion/runtime/realm/threads.cc:831
[0] #6  0x00007f000a673609 in start_thread (arg=<optimized out>) at pthread_create.c:477
[0] #7  0x00007f000c5e1353 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
[0] 
[0] Thread 7 (Thread 0x7efffcff9c80 (LWP 916215)):
[0] #0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
[0] #1  0x00007f000d28ece3 in Realm::Doorbell::wait_slow (this=0x7efffcff9800) at /home/rupanshu/legion/runtime/realm/mutex.cc:264
[0] #2  0x00007f000cffa8aa in Realm::Doorbell::wait (this=0x7efffcff9800) at /home/rupanshu/legion/runtime/realm/mutex.inl:81
[0] #3  0x00007f000cff7f21 in Realm::BackgroundWorkThread::main_loop (this=0x55fd26b8d8c0) at /home/rupanshu/legion/runtime/realm/bgwork.cc:144
[0] #4  0x00007f000cffbaf4 in Realm::Thread::thread_entry_wrapper<Realm::BackgroundWorkThread, &Realm::BackgroundWorkThread::main_loop> (obj=0x55fd26b8d8c0) at /home/rupanshu/legion/runtime/realm/threads.inl:97
[0] #5  0x00007f000d1ba60e in Realm::KernelThread::pthread_entry (data=0x55fd25916a50) at /home/rupanshu/legion/runtime/realm/threads.cc:831
[0] #6  0x00007f000a673609 in start_thread (arg=<optimized out>) at pthread_create.c:477
[0] #7  0x00007f000c5e1353 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
[0] 
[0] Thread 6 (Thread 0x7efffd0fdc80 (LWP 916214)):
[0] #0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
[0] #1  0x00007f000d28ece3 in Realm::Doorbell::wait_slow (this=0x7efffd0fd800) at /home/rupanshu/legion/runtime/realm/mutex.cc:264
[0] #2  0x00007f000cffa8aa in Realm::Doorbell::wait (this=0x7efffd0fd800) at /home/rupanshu/legion/runtime/realm/mutex.inl:81
[0] #3  0x00007f000cff7f21 in Realm::BackgroundWorkThread::main_loop (this=0x55fd26bcb670) at /home/rupanshu/legion/runtime/realm/bgwork.cc:144
[0] #4  0x00007f000cffbaf4 in Realm::Thread::thread_entry_wrapper<Realm::BackgroundWorkThread, &Realm::BackgroundWorkThread::main_loop> (obj=0x55fd26bcb670) at /home/rupanshu/legion/runtime/realm/threads.inl:97
[0] #5  0x00007f000d1ba60e in Realm::KernelThread::pthread_entry (data=0x55fd25916c40) at /home/rupanshu/legion/runtime/realm/threads.cc:831
[0] #6  0x00007f000a673609 in start_thread (arg=<optimized out>) at pthread_create.c:477
[0] #7  0x00007f000c5e1353 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
[0] 
[0] Thread 5 (Thread 0x7efffdda0700 (LWP 916213)):
[0] #0  0x00007f000c5d4bbf in __GI___poll (fds=0x7effd4000c70, nfds=13, timeout=100) at ../sysdeps/unix/sysv/linux/poll.c:29
[0] #1  0x00007f000a936d09 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
[0] #2  0x00007f000a9f2ebb in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
[0] #3  0x00007f000a9301a8 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
[0] #4  0x00007f000a673609 in start_thread (arg=<optimized out>) at pthread_create.c:477
[0] #5  0x00007f000c5e1353 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
[0] 
[0] Thread 4 (Thread 0x7effff57b700 (LWP 916212)):
[0] #0  0x00007f000c5d4bbf in __GI___poll (fds=0x55fd258fbe70, nfds=2, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
[0] #1  0x00007f000a936d09 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
[0] #2  0x00007f000a9f2ebb in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
[0] #3  0x00007f000a9301a8 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
[0] #4  0x00007f000a673609 in start_thread (arg=<optimized out>) at pthread_create.c:477
[0] #5  0x00007f000c5e1353 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
[0] 
[0] Thread 3 (Thread 0x7f00073ea700 (LWP 916202)):
[0] #0  0x00007f000c5e168e in epoll_wait (epfd=11, events=0x55fd25522280, maxevents=32, timeout=119952) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
[0] #1  0x00007f0009d31469 in ?? () from /lib/x86_64-linux-gnu/libevent_core-2.1.so.7
[0] #2  0x00007f0009d274a5 in event_base_loop () from /lib/x86_64-linux-gnu/libevent_core-2.1.so.7
[0] #3  0x00007f00074a57c6 in progress_engine () from /usr/local/pmix-4.1.1/lib/libpmix.so.2
[0] #4  0x00007f000a673609 in start_thread (arg=<optimized out>) at pthread_create.c:477
[0] #5  0x00007f000c5e1353 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
[0] 
[0] Thread 2 (Thread 0x7f0007e0b700 (LWP 916201)):
[0] #0  0x00007f000c5d4bbf in __GI___poll (fds=0x7f0000000b60, nfds=2, timeout=3599920) at ../sysdeps/unix/sysv/linux/poll.c:29
[0] #1  0x00007f0009d30801 in ?? () from /lib/x86_64-linux-gnu/libevent_core-2.1.so.7
[0] #2  0x00007f0009d274a5 in event_base_loop () from /lib/x86_64-linux-gnu/libevent_core-2.1.so.7
[0] #3  0x00007f0009dda766 in progress_engine () from /usr/local/openmpi-4.1.5/lib/libopen-pal.so.40
[0] #4  0x00007f000a673609 in start_thread (arg=<optimized out>) at pthread_create.c:477
[0] #5  0x00007f000c5e1353 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
[0] 
[0] Thread 1 (Thread 0x7f0007e35c80 (LWP 916195)):
[0] #0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
[0] #1  0x00007f000d28ece3 in Realm::Doorbell::wait_slow (this=0x7f0007e35800) at /home/rupanshu/legion/runtime/realm/mutex.cc:264
[0] #2  0x00007f000cffa8aa in Realm::Doorbell::wait (this=0x7f0007e35800) at /home/rupanshu/legion/runtime/realm/mutex.inl:81
[0] #3  0x00007f000d28ffb6 in Realm::UnfairCondVar::wait (this=0x55fd2548fa58) at /home/rupanshu/legion/runtime/realm/mutex.cc:949
[0] #4  0x00007f000d153cfd in Realm::RuntimeImpl::wait_for_shutdown (this=0x55fd2548f8c0) at /home/rupanshu/legion/runtime/realm/runtime_impl.cc:2610
[0] #5  0x00007f000d149fe4 in Realm::Runtime::wait_for_shutdown (this=0x7fff8ffced08) at /home/rupanshu/legion/runtime/realm/runtime_impl.cc:690
[0] #6  0x00007f0010655ad8 in Legion::Internal::Runtime::start (argc=14, argv=0x7fff8ffcf928, background=false, supply_default_mapper=true, filter=false) at /home/rupanshu/legion/runtime/legion/runtime.cc:30126
[0] #7  0x00007f001014e0bc in Legion::Runtime::start (argc=14, argv=0x7fff8ffcf928, background=false, default_mapper=true, filter=false) at /home/rupanshu/legion/runtime/legion/legion.cc:7690
[0] #8  0x00007f000ff9c58f in legion_runtime_start (argc=14, argv=0x7fff8ffcf928, background=false) at /home/rupanshu/legion/runtime/legion/legion_c.cc:7762
[0] #9  0x000055fd24a722b6 in main ()
[0] [Inferior 1 (process 916195) detached]
[0 - 7f0007e35c80]    0.013764 {4}{threads}: reservation ('GPU proc 1d00000000000004') cannot be satisfied
circuit settings: loops=1440 prune=30 pieces=20 (pieces/superpiece=10) nodes/piece=5000 (nodes/piece=50) wires/piece=20000 pct_in_piece=80 seed=12345
Circuit memory usage:
  Nodes :     100000 *   16 bytes =      1600000 bytes
  Wires :     400000 *  120 bytes =     48000000 bytes
  Total                                 49600000 bytes
WARNING: ODP shutdown in signal context
skipping build...
./build/circuit_ON_750_0.dir/circuit_ON_750_0 -npp 5000 -wpp 20000 -l 1440 -p 20 -pps 10 -prune 30 -hl:sched 1024 1 -ll:gpu 1 -ll:io 1 -ll:util 2 -ll:bgwork 4 -ll:csize 15000 -ll:fsize 15000 -ll:zsize 15000 -ll:rsize 0 -ll:gsize 0 -lg:eager_alloc_percentage 10 -lg:no_tracing -level runtime=5
failed: out_2_0_ON_750_0
lightsighter commented 5 months ago

I will need a reproducer that I can run and rebuild myself (next time provide those from the beginning). How did you modify the circuit simulation?

lightsighter commented 5 months ago

@rupanshusoi Any update on this? I'm making fixing this a requirement for merging control replication into master so I need a reproducer soon.

rupanshusoi commented 5 months ago

It's available here: http://sapling2.stanford.edu/~rupanshu/bug1618/

You can do sbatch --nodes 2 sbatch_circuit.sh to see the error. You might have to manually change the location of regent.py depending on your setup.

You can see the modifications in this version by grepping for wrapper. This task creates copies of every region that exists in the top-level task, then executes a fraction of the main inner-loop once or twice (depending on a compile-time parameter). If it executes the loop twice, it will restore every region from its copy beforehand. Let me know if you need more information.

lightsighter commented 5 months ago

Pull the latest control replication branch and try again.

rupanshusoi commented 5 months ago

Fixed, thanks!

rupanshusoi commented 5 months ago

I am reopening this issue because I'm hitting another assertion failure with this application:

circuit_ON_10_0: /global/u2/r/rsoi/legion/runtime/legion/legion_context.cc:22900: virtual Legion::Internal::RtEvent Legion::Internal::RemoteContext::compute_equivalence_sets(unsigned int, const std::vector<Legion::Internal::EqSetTracker*>&, const std::vector<unsigned  int>&, Legion::AddressSpaceID, Legion::Internal::IndexSpaceExpression*, const Legion::Internal::FieldMask&): Assertion `targets.size() == 1' failed.

The only change is that the wrapper task is now control replicated as well. I am unable to reproduce this error on Sapling.

Full backtrace. This backtrace is weird because it does not show the failing assertion; I'm not sure why that's the case or how to fix it.

lightsighter commented 5 months ago

You captured the backtraces on the wrong process. Please capture them from the right process where the assertion actually occurred. Make sure you get line numbers too.

lightsighter commented 5 months ago

I'll also note that if you're hitting this assertion you're doing a very very bad job at mapping if you're also using control replication because it means you're relying on remote mapping when you should be picking better sharding functors so you don't ever need to use remote mapping.

lightsighter commented 5 months ago

Get a proper backtrace with line numbers, report it here, and then pull the most recent control replication and confirm whether it is fixed of not.

rupanshusoi commented 5 months ago

New backtrace.

With the latest control replication, the application seems to hang toward the end.

lightsighter commented 5 months ago

That backtrace makes sense for the error from yesterday.

I need a reproducer for the hang as soon as possible.

rupanshusoi commented 5 months ago

The error reproduces on Perlmutter, but not on Sapling. I can make you a reproducer on Perlmutter, but I see you don't currently have an account in our group allocation. What would you suggest?

lightsighter commented 5 months ago

A reproducer on Perlmutter is not going to work even if I had an account because I can't attach gdb to processes that you create on that machine because I will not have sudo access.

Run with -ll:force_kthreads -lg:inorder -lg:safe_ctrlrepl 1 and get backtraces of all threads from every process. Make sure they are not changing. If the hang does not reproduce with -lg:inorder then you can remove it, but you have to have -ll:force_kthreads and -lg:safe_ctrlrepl 1.

rupanshusoi commented 5 months ago

The hang does reproduce with -lg:inorder. I captured two sets of backtraces a few minutes apart. They are mostly the same, so you will have to judge whether this is actually a hang, or something else.

First set: bt1.txt, bt2.txt Second set: bt1-new.txt, bt2-new.txt

lightsighter commented 5 months ago

Pull and try again. If it continues to hang please get new backtraces with the same options as before.

rupanshusoi commented 5 months ago

Now it hits an assertion instead of hanging. Full backtrace. This run was with all three flags you mentioned last time.

lightsighter commented 5 months ago

Pull and try again.

rupanshusoi commented 5 months ago

Another assertion failure.

lightsighter commented 5 months ago

Pull and try again.

rupanshusoi commented 5 months ago

Now it's hanging again: bt2.txt, bt1.txt

lightsighter commented 5 months ago

At this point I have to have a reproducer that I can look at. I promise you this will happen on sapling if you inject enough noise into the execution. If you don't know how to do that then just make the reproducer on sapling and tell me how to run it.

rupanshusoi commented 5 months ago

It's available here: http://sapling2.stanford.edu/~rupanshu/bug1618/

You can do sbatch --nodes 2 sbatch_circuit.sh to see the error. You might have to manually change the location of regent.py depending on your setup.

You can see the modifications in this version by grepping for wrapper. This task creates copies of every region that exists in the top-level task, then executes a fraction of the main inner-loop once or twice (depending on a compile-time parameter). If it executes the loop twice, it will restore every region from its copy beforehand. Let me know if you need more information.

I've updated this directory with the new code.

lightsighter commented 5 months ago

Which version of GASNetEX are you using on both sapling and perlmutter?

rupanshusoi commented 5 months ago

Actually, I think it is the same on both—GasNet-2023.3.0. My legion/language/gasnet has a sub-directory called GASNet-2023.3.0 in both installations. Does this confirm the GASNet version, or is there another way?

elliottslaughter commented 5 months ago

Keep in mind that even with the same GASNet version we're still talking about different networks, so timing differences are always possible.

lightsighter commented 5 months ago

We already figured this one out. It's not GASNet.

lightsighter commented 5 months ago

@rupanshusoi Please pull the most recent control replication branch and confirm that the hang is fixed (test it without the temporary fix that I gave you). If it works then you can close the issue.

rupanshusoi commented 5 months ago

I'm still seeing the same hang with the latest control replication. I ran with all three flags like last time.

lightsighter commented 5 months ago

Pull the latest control replication and try again.

rupanshusoi commented 4 months ago

Fixed.