StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
678 stars 145 forks source link

Regent/libraries: Choosing multiple util threads with ll:util results in crashes #811

Closed LonelyCat124 closed 4 years ago

LonelyCat124 commented 4 years ago

I have a small test code which at current has performance limited by the runtime when not specifying -ll:util with enough threads to fill the cpu threads with enough work. Often, providing high values for -ll:util (8 or 16) will lead to crashes during execution (single node runs).

Some examples on a larger testcase: legion/language/regent.py algorithm.rg -ll:cpu 16 -ll:util 8

...
[0x2b614c000708]
/PATH/TO//legion/bindings/regent/libregent.so(_ZN6Legion8Internal10RegionNode
10visit_nodeEPNS0_13NodeTraverserE+0x29c) [0x2b60f887ce7c]
/PATH/TO//legion/bindings/regent/libregent.so(_ZN6Legion8Internal14RegionTree
Node25register_logical_deletionEjRKNS0_11LogicalUserERK12AVXTLBitMaskILj512EERNS0_14RegionTreePathERKNS0_16LogicalTraceInfoERS6_+0x3c5) [0x2b60f8891b95]
/PATH/TO//legion/bindings/regent/libregent.so(_ZN6Legion8Internal16RegionTree
Forest25perform_deletion_analysisEPNS0_10DeletionOpEjRNS_17RegionRequirementERNS0_14RegionTreePathE+0x140) [0x2b60f8852350]
/PATH/TO//legion/bindings/regent/libregent.so(_ZN6Legion8Internal10DeletionOp
27trigger_dependence_analysisEv+0x23d) [0x2b60f85e126d]
/PATH/TO//legion/bindings/regent/libregent.so(_ZTVN6Legion8Internal10Deletion
OpE+0x140) [0x2b60fa24d1a0]
/PATH/TO//legion/bindings/regent/libregent.so(_ZN6Legion8Internal10DeletionOp
D1Ev+0) [0x2b60f85e0c50]
[0xba01490b11358d48]

legion/language/regent.py algorithm.rg -ll:cpu 32 -ll:util 16

....
terra: /PATH/TO/legion/runtime/realm/rsrv_impl.cc:437: Realm::Event Realm::Reservat
ionImpl::acquire(unsigned int, bool, Realm::ReservationImpl::AcquireType, Realm::Event): Assertion `(NodeID(ID(me).rsrv_creator_node()) != Network::my_node_id) || in_use' failed.

On the standard size testcase legion/language/regent.py algorithm.rg -ll:cpu 2 -ll:util 8

terra: /PATH/TO/legion/runtime/realm/rsrv_impl.cc:1511: void Realm::FastReservation
::unlock_slow(): Assertion `reader_count > 0' failed

Legion branch: master Installed with ./install.py --openmp git commit: 04062d54930aebf9b01a720a4172939a67916984

Code I'm running is available https://github.com/stfc/PSycloneBench (branch regent_version, commit d7b19022d03de1cf388228b578712689279b1e87) Running on a 32 core Broadwell chip with hyperthreading enabled.

lightsighter commented 4 years ago

Can you get a backtrace with line numbers from a debugger when running with the runtime compiled in debug mode or at least with -O2 -g? It's also worth noting that after about 4 utility processors per node that you're unlikely to see speedups in runtime analysis work. If you want to improve runtime overhead, your best bet will be to coarsen your tasks or to employ tracing.

LonelyCat124 commented 4 years ago

I rebuilt regent with ./install.py --openmp --debug --clean, and was only able to produce the error:

terra: /PATH/TO/legion/runtime/realm/rsrv_impl.cc:1511: void Realm::FastReservation
::unlock_slow(): Assertion `reader_count > 0' failed

The core files produced have the following trace at the point they are dumped:

#0  0x00002b9567e79207 in raise () from /lib64/libc.so.6
#1  0x00002b9567e7a8f8 in abort () from /lib64/libc.so.6
#2  0x00002b9567e72026 in __assert_fail_base () from /lib64/libc.so.6
#3  0x00002b9567e720d2 in __assert_fail () from /lib64/libc.so.6
#4  0x00002b956aa70045 in Realm::FastReservation::unlock_slow (this=0x2b9c3de0eb14)
    at /PATH/TO/legion/runtime/realm/rsrv_impl.cc:1511
#5  0x00002b9569e82126 in Realm::FastReservation::unlock (this=0x2b9c3de0eb14)
    at /PATH/TO/legion/runtime/realm/reservation.inl:177
#6  0x00002b9569e82532 in Legion::Internal::LocalLock::unlock (this=0x2b9c3de0eb14)
    at /PATH/TO/legion/runtime/legion/legion_types.h:2136
#7  0x00002b9569e826ca in Legion::Internal::AutoLock::~AutoLock (this=0x2b9c3e716500, __in_chrg=<optimized out>)
    at /PATH/TO/legion/runtime/legion/legion_types.h:2206
#8  0x00002b956a1a9767 in Legion::Internal::DynamicTable<Legion::Internal::DynamicTableAllocator<Legion::Internal::LogicalSt
ate, 10ul, 8ul> >::lookup_entry<Legion::Internal::RegionTreeNode*, unsigned int> (this=0x2b9c2eb76dd0, index=2,
    arg1=@0x2b9c3e716578: 0x2b9c2eb76c40, arg2=@0x2b9c3e716564: 2)
    at /PATH/TO/legion/runtime/legion/legion_utilities.h:1695
#9  0x00002b956a1a8c8c in Legion::Internal::RegionTreeNode::get_logical_state (this=0x2b9c2eb76c40, ctx=2)
    at /PATH/TO/legion/runtime/legion/region_tree.h:3060
#10 0x00002b956a19da77 in Legion::Internal::RegionTreeNode::register_logical_dependences (this=0x2b9c2eb76c40, ctx=2,
    op=0x2b961c4349b0, field_mask=..., dominate=false)
    at /PATH/TO/legion/runtime/legion/region_tree.cc:13822
#11 0x00002b956a089dd5 in Legion::Internal::LogicalRegistrar::visit_region (this=0x2b9c3e716ae0, node=0x2b9c2eb76c40)
    at /PATH/TO/legion/runtime/legion/legion_analysis.cc:1839
#12 0x00002b956a1a14b3 in Legion::Internal::RegionNode::visit_node (this=0x2b9c2eb76c40, traverser=0x2b9c3e716ae0)
    at /PATH/TO/legion/runtime/legion/region_tree.cc:15052
#13 0x00002b956a1a5932 in Legion::Internal::PartitionNode::visit_node (this=0x2b95e00c8e20, traverser=0x2b9c3e716ae0)
    at /PATH/TO/legion/runtime/legion/region_tree.cc:16302
#14 0x00002b956a1a1754 in Legion::Internal::RegionNode::visit_node (this=0x2b95c400ccd0, traverser=0x2b9c3e716ae0)
    at /PATH/TO/legion/runtime/legion/region_tree.cc:15089
#15 0x00002b956a19e0f5 in Legion::Internal::RegionTreeNode::register_logical_deletion (this=0x2b95c400ccd0, ctx=2,
    user=..., check_mask=..., path=..., trace_info=..., already_closed_mask=...)
    at /PATH/TO/legion/runtime/legion/region_tree.cc:13970
#16 0x00002b956a161edc in Legion::Internal::RegionTreeForest::perform_deletion_analysis (this=0x7276650,
    op=0x2b961c4349b0, idx=0, req=..., path=...)
    at /PATH/TO/legion/runtime/legion/region_tree.cc:1327
#17 0x00002b9569e562c8 in Legion::Internal::DeletionOp::trigger_dependence_analysis (this=0x2b961c4349b0)
    at /PATH/TO/legion/runtime/legion/legion_ops.cc:7905
#18 0x00002b9569e34254 in Legion::Internal::Operation::execute_dependence_analysis (this=0x2b961c4349b0)
    at /PATH/TO/legion/runtime/legion/legion_ops.cc:377
---Type <return> to continue, or q <return> to quit---
#19 0x00002b9569f56910 in Legion::Internal::InnerContext::process_dependence_stage (this=0x2b9c2f48b770)
    at /PATH/TO/legion/runtime/legion/legion_context.cc:6440
#20 0x00002b9569f5ff02 in Legion::Internal::InnerContext::handle_dependence_stage (args=0x2b95e807de40)
    at /PATH/TO/legion/runtime/legion/legion_context.cc:8593
#21 0x00002b956a251a3f in Legion::Internal::Runtime::legion_runtime_task (args=0x2b95e807de40, arglen=12,
    userdata=0x64e20d0, userlen=8, p=...)
    at /PATH/TO/legion/runtime/legion/runtime.cc:22359
#22 0x00002b956aa7a748 in Realm::LocalTaskProcessor::execute_task (this=0x5f8f220, func_id=4, task_args=...)
    at /PATH/TO/legion/runtime/realm/proc_impl.cc:981
#23 0x00002b956a8df148 in Realm::Task::execute_on_processor (this=0x2b95e807dca0, p=...)
    at /PATH/TO/legion/runtime/realm/tasks.cc:199
#24 0x00002b956a8e39c4 in Realm::UserThreadTaskScheduler::execute_task (this=0x6f6a6f0, task=0x2b95e807dca0)
    at /PATH/TO/legion/runtime/realm/tasks.cc:1431
#25 0x00002b956a8e1abf in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x6f6a6f0)
    at /PATH/TO/legion/runtime/realm/tasks.cc:935
#26 0x00002b956a8e8254 in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::s
cheduler_loop> (obj=0x6f6a6f0)
    at /PATH/TO/legion/runtime/realm/threads.inl:97
#27 0x00002b956a8bd3a4 in Realm::UserThread::uthread_entry ()
    at /PATH/TO/legion/runtime/realm/threads.cc:1036
#28 0x00002b9567e8b010 in ?? () from /lib64/libc.so.6
#29 0x0000000000000000 in ?? ()

Regarding task coarsening, what would be an appropriate runtime for a task? From profiling it looks like the runtime overheads are ~2000us (2ms) per task, but for the OpenMP version of the standard size benchmark of the code a full iteration is ~1.13ms (while the best I've achieved with Regent so far is around 3.5ms/step, using 8 utility threads and 16 worker threads with __demand(__index_launch, __trace) - I'm assuming that using __trace works with partitions based on the same region/index space regardless of if the partition is recreated?

lightsighter commented 4 years ago

Does this error happen only near the end of the run or somewhere in the middle?

Regarding task coarsening, what would be an appropriate runtime for a task?

Currently we recommend that your tasks be about 1ms on average of work in the general case. This doesn't mean that they all have to be at least 1ms, you can tolerate a few smaller tasks for example if you have a 10ms task also. If you start to use tracing you can push overheads down to about 250us per task at the moment.

I'm assuming that using __trace works with partitions based on the same region/index space regardless of if the partition is recreated?

Tracing will have to recapture a trace every time you make a new partition. If you make a new partition every iteration then we won't ever be able to replay a trace.

LonelyCat124 commented 4 years ago

I think its pretty early/middle,the larger testcase with the code causing the crashes was taking ~1000s and i'd guess its a minute or two into the run.

Thanks for the info, the tracing stuff I'd misunderstood then.

streichler commented 4 years ago

Using this command line:

legion/language/regent.py algorithm.rg -ll:cpu 2 -ll:util 8

I can reproduce this failure:

terra: /PATH/TO/legion/runtime/realm/rsrv_impl.cc:1511: void Realm::FastReservation
::unlock_slow(): Assertion `reader_count > 0' failed

Realm is complaining because unlock is being called on a FastReservation that is not locked. The calling code is using scoped locks, so they should be balanced, suggesting the problem may be memory corruption.

streichler commented 4 years ago

Enabling existing reservation debug code and/or running with kernel threads seems to make the failure stop happening. I think I've managed to confirm that the reservation object being unlocked is still allocated (i.e. it's not a use-after-free of the reservation itself), but that's all I have so far.

streichler commented 4 years ago

Ok, here's a backtrace where unlock is called on a FastReservation that's been deleted.

#3  0x00007ffff6492c82 in __GI___assert_fail (
    assertion=assertion@entry=0x7fffdca27080 "frs.magic == FastRsrvState::MAGIC_VALID", 
    file=file@entry=0x7fffdca26fd0 "/local/home/sean/legion/legion2/runtime/realm/rsrv_impl.cc", line=line@entry=1493, 
    function=function@entry=0x7fffdca27b60 <Realm::FastReservation::unlock_slow()::__PRETTY_FUNCTION__> "void Realm::FastReservation::unlock_slow()") at assert.c:101
#4  0x00007fffdc470009 in Realm::FastReservation::unlock_slow (this=<optimized out>)
    at /local/home/sean/legion/legion2/runtime/realm/rsrv_impl.cc:1493
#5  0x00007fffdbeb4e64 in Legion::Internal::LocalLock::unlock (this=<optimized out>)
    at /local/home/sean/legion/legion2/runtime/legion/legion_types.h:2136
#6  Legion::Internal::AutoLock::~AutoLock (this=0x7fff8ce4b110, __in_chrg=<optimized out>)
    at /local/home/sean/legion/legion2/runtime/legion/legion_types.h:2206
#7  Legion::Internal::DynamicTable<Legion::Internal::DynamicTableAllocator<Legion::Internal::LogicalState, 10ul, 8ul> >::lookup_leaf (this=0x7fff94dace50, index=7) at /local/home/sean/legion/legion2/runtime/legion/legion_utilities.h:1723
#8  0x00007fffdbeb4f4a in Legion::Internal::DynamicTable<Legion::Internal::DynamicTableAllocator<Legion::Internal::LogicalState, 10ul, 8ul> >::lookup_entry<Legion::Internal::RegionTreeNode*, unsigned int> (this=<optimized out>, 
    index=<optimized out>, arg1=@0x7fff8ce4b1e0: 0x7fff94daccc0, arg2=@0x7fff8ce4b1dc: 7, 
    arg2@entry=@0x7fff8ce4b1dc: <optimized out>) at /local/home/sean/legion/legion2/runtime/legion/legion_utilities.h:1683
#9  0x00007fffdbe6ca08 in Legion::Internal::RegionTreeNode::get_logical_state (ctx=7, this=<optimized out>)
    at /local/home/sean/legion/legion2/runtime/legion/region_tree.h:3060
#10 Legion::Internal::RegionTreeNode::register_logical_dependences (this=<optimized out>, ctx=<optimized out>, 
    op=0x7fff78011e20, field_mask=..., dominate=false)
    at /local/home/sean/legion/legion2/runtime/legion/region_tree.cc:13822
#11 0x00007fffdbde0a85 in Legion::Internal::LogicalRegistrar::visit_region (this=<optimized out>, node=<optimized out>)
    at /local/home/sean/legion/legion2/runtime/legion/legion_analysis.cc:1839
#12 0x00007fffdbea2850 in Legion::Internal::RegionNode::visit_node (this=0x7fff94daccc0, traverser=0x7fff8ce4b620)
    at /local/home/sean/legion/legion2/runtime/legion/region_tree.cc:15052
#13 0x00007fffdbeab491 in Legion::Internal::PartitionNode::visit_node (this=<optimized out>, traverser=0x7fff8ce4b620)
    at /local/home/sean/legion/legion2/runtime/legion/region_tree.cc:16302
--Type <RET> for more, q to quit, c to continue without paging--
#14 0x00007fffdbea2958 in Legion::Internal::RegionNode::visit_node (this=<optimized out>, traverser=0x7fff8ce4b620)
    at /local/home/sean/legion/legion2/runtime/legion/region_tree.cc:15089
#15 0x00007fffdbe9df7e in Legion::Internal::RegionTreeNode::register_logical_deletion (this=this@entry=0x7fff80012880, 
    ctx=ctx@entry=7, user=..., check_mask=..., path=..., trace_info=..., already_closed_mask=...)
    at /local/home/sean/legion/legion2/runtime/legion/region_tree.cc:13970
#16 0x00007fffdbea480f in Legion::Internal::RegionTreeForest::perform_deletion_analysis (this=<optimized out>, 
    op=op@entry=0x7fff78011e20, idx=idx@entry=0, req=..., path=...)
    at /local/home/sean/legion/legion2/runtime/legion/region_tree.cc:1327
#17 0x00007fffdbcb35b1 in Legion::Internal::DeletionOp::trigger_dependence_analysis (this=0x7fff78011e20)
    at /local/home/sean/legion/legion2/runtime/legion/legion_ops.cc:7905
#18 0x00007fffdbcb91c4 in Legion::Internal::Operation::execute_dependence_analysis (this=0x7fff78011e20)
    at /local/home/sean/legion/legion2/runtime/legion/legion_ops.cc:377
#19 0x00007fffdbd44a0c in Legion::Internal::InnerContext::process_dependence_stage (this=0x7fff89029b00)
    at /local/home/sean/legion/legion2/runtime/legion/legion_context.cc:6440
#20 0x00007fffdbd44c39 in Legion::Internal::InnerContext::handle_dependence_stage (args=args@entry=0x7fff8428a250)
    at /local/home/sean/legion/legion2/runtime/legion/legion_context.cc:8593
#21 0x00007fffdbf23608 in Legion::Internal::Runtime::legion_runtime_task (args=0x7fff8428a250, arglen=12, 
    userdata=<optimized out>, userlen=<optimized out>, p=...)
    at /local/home/sean/legion/legion2/runtime/legion/runtime.cc:22359

@lightsighter , does this suggest anything to you?

lightsighter commented 4 years ago

I've still been unable to reproduce this, but looking at the backtrace, I think this is a variation of the issue that prompted #812. It looks like a duplicated deletion of the region tree, probably via a transitive deletion that @LonelyCat124 was not expecting. Sometimes the runtime is able to detect the duplicate deletion and prevent it, but it's literally impossible for it to get every single case right and the result is undefined behavior like here where the region tree has been deleted while someone else is still trying to traverse it to do their own deletion. I'm inclined to say that we should resolve issue #812 first, update the deletions in the code to match whatever the new semantics are, and then see if the bug reproduces.

lightsighter commented 4 years ago

812 now has a fix in the master branch. @LonelyCat124 please pull the master branch and try your code again to see if this issue has been resolved.

LonelyCat124 commented 4 years ago

This seems to be fixed by that.