StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
675 stars 145 forks source link

Regent: Code hang #929

Open LonelyCat124 opened 4 years ago

LonelyCat124 commented 4 years ago

I have a timestep-based code in Regent, which on my current testcase of interest hangs after 5 timesteps, but appears not to do so on a smaller testcase.

I've tried:

  1. Compiling with debug mode - no difference/errors
  2. Running the smaller testcase with -fbounds-check (no errors)
  3. Running with freeze on error
  4. Running with in order execution

I can't remember if I ran with -lg:partcheck yet so I've set that now.

I pulled two stack traces from gdb at the point the code seemed to be frozen (around 12 hours apart). Most threads appear to be in Realm::condvar::Wait, with what seems to be a couple of threads in acquiring and AutoLock and another thread in Realm::DyanmicTable<Realm::DynamicTableAllocator<Realm::SparsityMapImplWrapper, 10ul, 4ul> >::lookup_entry

I had one run crash immediately with:

terra: /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/legion/region_tree.cc:16345: void Legion::Internal::RegionNode::add_child(Legion::Internal::PartitionNode*): Assertion `color_map.find(child->row_source->color) == color_map.end()' failed

but have not been able to replicate that error so have no idea of the cause.

Is there anything else I should try to work out what can be causing this? The smaller testcase doesn't necessarily produce that many fewer tasks, but should large numbers of tasks cause potential hangs (sounds unlikely)?

LonelyCat124 commented 4 years ago

-lg:partcheck hasn't found any issues up to the point the code is freezing.

lightsighter commented 4 years ago

Please provide backtraces for the threads NOT in Realm::convar::wait.

LonelyCat124 commented 4 years ago

Sure, here they are:

Thread 33 (Thread 0x7f49ad93b780 (LWP 36417)):
#0  0x00007f49af682524 in std::operator& (__m=32569, __mod=2965800160) at /netfs/smain01/scafellpike/local/apps/gcc9/9.3.0/include/c++/9.3.0/bits/atomic_base.h:100
#1  0x00007f49af68a3c8 in compare_exchange_strong (__m2=std::memory_order_acquire, __m1=std::memory_order_acq_rel, __i2=134217728, __i1=@0x7f39b0c673f8: 268435456,
    this=0xdb9b0b0) at /netfs/smain01/scafellpike/local/apps/gcc9/9.3.0/include/c++/9.3.0/bits/atomic_base.h:496
#2  compare_exchange_strong (__m=std::memory_order_acq_rel, __i2=134217728, __i1=@0x7f39b0c673f8: 268435456, this=0xdb9b0b0)
    at /netfs/smain01/scafellpike/local/apps/gcc9/9.3.0/include/c++/9.3.0/bits/atomic_base.h:527
#3  Realm::atomic<unsigned int>::compare_exchange (this=0xdb9b0b0, expected=@0x7f39b0c673f8: 268435456, newval=134217728)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/realm/atomics.inl:154
#4  0x00007f49b02b78b4 in Realm::FastReservation::wrlock_slow (this=0xdb9b0b0, mode=Realm::FastReservation::SPIN)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/realm/rsrv_impl.cc:960
#5  0x00007f49af6825fa in Realm::FastReservation::wrlock (this=0xdb9b0b0, mode=Realm::FastReservation::SPIN)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/realm/reservation.inl:70
#6  0x00007f49af682b1b in Legion::Internal::LocalLock::wrlock (this=0xdb9b0b0)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/legion/legion_types.h:2172
#7  0x00007f49af682c3b in Legion::Internal::AutoLock::AutoLock (this=0x7f39b0c677b0, r=..., mode=0, excl=true)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/legion/legion_types.h:2208
#8  0x00007f49af9a7735 in Legion::Internal::RegionTreeForest::subtract_index_spaces (this=0xdb9afa0, lhs=0x7f3a8684b280, rhs=0x7f3a8684b280, creator=0x0)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/legion/region_tree.cc:5715
#9  0x00007f49af8d82a3 in Legion::Internal::EquivalenceSet::ray_trace_equivalence_sets (this=0x7f4474ecf520, target=0x7f3ae6e268c0, expr=0x7f3a8684b280, ray_mask=...,
    handle=..., source=0, trace_done=..., deferral_event=...)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/legion/legion_analysis.cc:9140
#10 0x00007f49af8d918b in Legion::Internal::EquivalenceSet::ray_trace_equivalence_sets (this=0x7f4584cad660, target=0x7f3ae6e268c0, expr=0x7f3aa9d35b80, ray_mask=...,
    handle=..., source=0, trace_done=..., deferral_event=...)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/legion/legion_analysis.cc:9336
#11 0x00007f49af8f247f in Legion::Internal::EquivalenceSet::handle_ray_trace (args=0x7f3b7de60460, runtime=0xdb946c0)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/legion/legion_analysis.cc:13552
#12 0x00007f49afa8bed9 in Legion::Internal::Runtime::legion_runtime_task (args=0x7f3b7de60460, arglen=100, userdata=0xdb6f220, userlen=8, p=...)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/legion/runtime.cc:24686
#13 0x00007f49b02c2591 in Realm::LocalTaskProcessor::execute_task (this=0xd9ca9b0, func_id=4, task_args=...)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/realm/proc_impl.cc:1099
#14 0x00007f49b0106fd2 in Realm::Task::execute_on_processor (this=0x7f3a1d4bf7f0, p=...)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/realm/tasks.cc:306
#15 0x00007f49b010bc24 in Realm::UserThreadTaskScheduler::execute_task (this=0xd9cab60, task=0x7f3a1d4bf7f0)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/realm/tasks.cc:1648
#16 0x00007f49b0109dda in Realm::ThreadedTaskScheduler::scheduler_loop (this=0xd9cab60)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/realm/tasks.cc:1129
#17 0x00007f49b01110ba in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop> (obj=0xd9cab60)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/realm/threads.inl:97
#18 0x00007f49b00e4275 in Realm::UserThread::uthread_entry ()
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/realm/threads.cc:1158
#19 0x00007f49b1435010 in ?? () from /lib64/libc.so.6
#20 0x0000000000000000 in ?? ()

Thread 32 (Thread 0x7f49ad904780 (LWP 36418)):
#0  Realm::DynamicTable<Realm::DynamicTableAllocator<Realm::SparsityMapImplWrapper, 10ul, 4ul> >::lookup_entry (this=0xdb869e0, index=16777216, owner=0,
    free_list=0xd9c9960)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/realm/dynamic_table.inl:202
#1  0x00007f49afe9badc in Realm::DynamicTableFreeList<Realm::DynamicTableAllocator<Realm::SparsityMapImplWrapper, 10ul, 4ul> >::alloc_entry (this=0xd9c9960)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/realm/dynamic_table.inl:333
#2  0x00007f49afe8f1b8 in Realm::RuntimeImpl::get_available_sparsity_impl (this=0xdabcd50, target_node=0)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/realm/runtime_impl.cc:2406
#3  0x00007f49b02417e6 in Realm::IntersectionOperation<1, long long>::add_intersection (this=0x7f3a6988eaa0, ops=...)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/realm/deppart/setops.cc:1682
#4  0x00007f49b024413b in Realm::IndexSpace<1, long long>::compute_intersection (subspaces=..., result=..., reqs=..., wait_on=...)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/realm/deppart/setops.cc:516
#5  0x00007f49afb769b5 in Legion::Internal::IndexSpaceIntersection<1, long long>::IndexSpaceIntersection (this=0x7f3a6988de60, to_inter=..., ctx=0xdb9afa0)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/legion/region_tree.inl:1739
#6  0x00007f49afa2ba1d in Legion::Internal::IntersectionOpCreator::demux<Realm::DynamicTemplates::Int<1>, long long> (creator=0x7f39cd3fdd20)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/legion/region_tree.h:1459
#7  0x00007f49afa29f04 in Realm::DynamicTemplates::ListProduct2<Realm::DynamicTemplates::IntList<1, 3>, Realm::DynamicTemplates::TypeListElem<int, Realm::DynamicTemplates::TypeListElem<unsigned int, Realm::DynamicTemplates::TypeListElem<long long, Realm::DynamicTemplates::TypeListTerm> > > >::DemuxHelper2<Legion::Internal::IntersectionOpCreator, Realm::DynamicTemplates::Int<1> >::demux<long long, Legion::Internal::IntersectionOpCreator*> (arg1=0x7f39cd3fdd20)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/realm/dynamic_templates.inl:266
#8  0x00007f49afa27681 in Realm::DynamicTemplates::TypeListElem<long long, Realm::DynamicTemplates::TypeListTerm>::DemuxHelper<Realm::DynamicTemplates::ListProduct2<Realm::DynamicTemplates::IntList<1, 3>, Realm::DynamicTemplates::TypeListElem<int, Realm::DynamicTemplates::TypeListElem<unsigned int, Realm::DynamicTemplates::TypeListElem<long long, Realm::DynamicTemplates::TypeListTerm> > > >::DemuxHelper2<Legion::Internal::IntersectionOpCreator, Realm::DynamicTemplates::Int<1> >, 2>::demux<Legion::Internal::IntersectionOpCreator*> (index=2, arg1=0x7f39cd3fdd20)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/realm/dynamic_templates.inl:37
#9  0x00007f49afa25264 in Realm::DynamicTemplates::TypeListElem<unsigned int, Realm::DynamicTemplates::TypeListElem<long long, Realm::DynamicTemplates::TypeListTerm> >::DemuxHelper<Realm::DynamicTemplates::ListProduct2<Realm::DynamicTemplates::IntList<1, 3>, Realm::DynamicTemplates::TypeListElem<int, Realm::DynamicTemplates::TypeListElem<unsigned int, Realm::DynamicTemplates::TypeListElem<long long, Realm::DynamicTemplates::TypeListTerm> > > >::DemuxHelper2<Legion::Internal::IntersectionOpCreator, Realm::DynamicTemplates::Int<1> >, 1>::demux<Legion::Internal::IntersectionOpCreator*> (index=2, arg1=0x7f39cd3fdd20)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/realm/dynamic_templates.inl:39
#10 0x00007f49afa20a96 in Realm::DynamicTemplates::TypeListElem<int, Realm::DynamicTemplates::TypeListElem<unsigned int, Realm::DynamicTemplates::TypeListElem<long long, Realm::DynamicTemplates::TypeListTerm> > >::DemuxHelper<Realm::DynamicTemplates::ListProduct2<Realm::DynamicTemplates::IntList<1, 3>, Realm::DynamicTemplates::TypeListElem<int, Realm::DynamicTemplates::TypeListElem<unsigned int, Realm::DynamicTemplates::TypeListElem<long long, Realm::DynamicTemplates::TypeListTerm> > > >::DemuxHelper2<Legion::Internal::IntersectionOpCreator, Realm::DynamicTemplates::Int<1> >, 0>::demux<Legion::Internal::IntersectionOpCreator*> (index=2, arg1=0x7f39cd3fdd20)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/realm/dynamic_templates.inl:39
#11 0x00007f49afa1b616 in Realm::DynamicTemplates::TypeListElem<int, Realm::DynamicTemplates::TypeListElem<unsigned int, Realm::DynamicTemplates::TypeListElem<long long, Realm::DynamicTemplates::TypeListTerm> > >::demux<Realm::DynamicTemplates::ListProduct2<Realm::DynamicTemplates::IntList<1, 3>, Realm::DynamicTemplates::TypeListElem<int, Realm::DynamicTemplates::TypeListElem<unsigned int, Realm::DynamicTemplates::TypeListElem<long long, Realm::DynamicTemplates::TypeListTerm> > > >::DemuxHelper2<Legion::Internal::IntersectionOpCreator, Realm::DynamicTemplates::Int<1> >, Legion::Internal::IntersectionOpCreator*> (index=2, arg1=0x7f39cd3fdd20)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/realm/dynamic_templates.inl:79
#12 0x00007f49afa0f5d9 in Realm::DynamicTemplates::ListProduct2<Realm::DynamicTemplates::IntList<1, 3>, Realm::DynamicTemplates::TypeListElem<int, Realm::DynamicTemplates::TypeListElem<unsigned int, Realm::DynamicTemplates::TypeListElem<long long, Realm::DynamicTemplates::TypeListTerm> > > >::DemuxHelper1<Legion::Internal::IntersectionOpCreator>::demux<Realm::DynamicTemplates::Int<1>, Legion::Internal::IntersectionOpCreator*> (tag=258, arg1=0x7f39cd3fdd20)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/realm/dynamic_templates.inl:291
#13 0x00007f49af9fadd7 in Realm::DynamicTemplates::IntList<1, 3>::DemuxHelper<Realm::DynamicTemplates::ListProduct2<Realm::DynamicTemplates::IntList<1, 3>, Realm::DynamicTemplates::TypeListElem<int, Realm::DynamicTemplates::TypeListElem<unsigned int, Realm::DynamicTemplates::TypeListElem<long long, Realm::DynamicTemplates::TypeListTerm> > > >::DemuxHelper1<Legion::Internal::IntersectionOpCreator>, 1>::demux<unsigned int, Legion::Internal::IntersectionOpCreator*> (index=1, arg1=258, arg2=0x7f39cd3fdd20)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/realm/dynamic_templates.inl:158
#14 0x00007f49af9ed775 in Realm::DynamicTemplates::IntList<1, 3>::demux<Realm::DynamicTemplates::ListProduct2<Realm::DynamicTemplates::IntList<1, 3>, Realm::DynamicTemplates::TypeListElem<int, Realm::DynamicTemplates::TypeListElem<unsigned int, Realm::DynamicTemplates::TypeListElem<long long, Realm::DynamicTemplates::TypeListTerm> > > >::DemuxHelper1<Legion::Internal::IntersectionOpCreator>, unsigned int, Legion::Internal::IntersectionOpCreator*> (index=1, arg1=258, arg2=0x7f39cd3fdd20)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/realm/dynamic_templates.inl:232
#15 0x00007f49af9ddc9a in Realm::DynamicTemplates::ListProduct2<Realm::DynamicTemplates::IntList<1, 3>, Realm::DynamicTemplates::TypeListElem<int, Realm::DynamicTemplates::TypeListElem<unsigned int, Realm::DynamicTemplates::TypeListElem<long long, Realm::DynamicTemplates::TypeListTerm> > > >::demux<Legion::Internal::IntersectionOpCreator, Legion::Internal::IntersectionOpCreator*> (tag=258, arg1=0x7f39cd3fdd20)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/realm/dynamic_templates.inl:317
#16 0x00007f49af9db911 in Legion::Internal::IntersectionOpCreator::create_operation (this=0x7f39cd3fdd20)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/legion/region_tree.h:1464
#17 0x00007f49af9aa564 in Legion::Internal::OperationCreator::consume (this=0x7f39cd3fdd20)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/legion/region_tree.cc:6454
#18 0x00007f49af9ab57c in Legion::Internal::ExpressionTrieNode::find_or_create_operation (this=0x7f3a698909a0, expressions=..., creator=...)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/legion/region_tree.cc:6687
#19 0x00007f49af9a6fb5 in Legion::Internal::RegionTreeForest::intersect_index_spaces (this=0xdb9afa0, expressions=..., creator=0x0)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/legion/region_tree.cc:5616
#20 0x00007f49af9a69fa in Legion::Internal::RegionTreeForest::intersect_index_spaces (this=0xdb9afa0, lhs=0x7f39ca4a1b00, rhs=0x7f459c01b940)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/legion/region_tree.cc:5537
#21 0x00007f49af8d7d82 in Legion::Internal::EquivalenceSet::ray_trace_equivalence_sets (this=0x7f44eb615700, target=0x7f3ae6e28980, expr=0x7f39ca4a1b00, ray_mask=...,
    handle=..., source=0, trace_done=..., deferral_event=...)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/legion/legion_analysis.cc:9073
#22 0x00007f49af8f247f in Legion::Internal::EquivalenceSet::handle_ray_trace (args=0x7f3e4a3d7e50, runtime=0xdb946c0)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/legion/legion_analysis.cc:13552
#23 0x00007f49afa8bed9 in Legion::Internal::Runtime::legion_runtime_task (args=0x7f3e4a3d7e50, arglen=100, userdata=0xda71540, userlen=8, p=...)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/legion/runtime.cc:24686
#24 0x00007f49b02c2591 in Realm::LocalTaskProcessor::execute_task (this=0xd9caf70, func_id=4, task_args=...)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/realm/proc_impl.cc:1099
#25 0x00007f49b0106fd2 in Realm::Task::execute_on_processor (this=0x7f3a637e7350, p=...)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/realm/tasks.cc:306
#26 0x00007f49b010bc24 in Realm::UserThreadTaskScheduler::execute_task (this=0xd9cb160, task=0x7f3a637e7350)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/realm/tasks.cc:1648
#27 0x00007f49b0109dda in Realm::ThreadedTaskScheduler::scheduler_loop (this=0xd9cb160)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/realm/tasks.cc:1129
#28 0x00007f49b01110ba in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop> (obj=0xd9cb160)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/realm/threads.inl:97
#29 0x00007f49b00e4275 in Realm::UserThread::uthread_entry ()
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/realm/threads.cc:1158
#30 0x00007f49b1435010 in ?? () from /lib64/libc.so.6
#31 0x0000000000000000 in ?? ()

Thread 31 (Thread 0x7f49a83a9780 (LWP 36419)):
#0  load (__m=std::memory_order_acquire, this=0xdb9b0b0) at /netfs/smain01/scafellpike/local/apps/gcc9/9.3.0/include/c++/9.3.0/bits/atomic_base.h:419
#1  Realm::atomic<unsigned int>::load_acquire (this=0xdb9b0b0)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/realm/atomics.inl:73
#2  0x00007f49b02b7f2b in Realm::FastReservation::rdlock_slow (this=0xdb9b0b0, mode=Realm::FastReservation::SPIN)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/realm/rsrv_impl.cc:1176
#3  0x00007f49af68268e in Realm::FastReservation::rdlock (this=0xdb9b0b0, mode=Realm::FastReservation::SPIN)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/realm/reservation.inl:114
#4  0x00007f49af682b59 in Legion::Internal::LocalLock::rdlock (this=0xdb9b0b0)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/legion/legion_types.h:2173
#5  0x00007f49af682c66 in Legion::Internal::AutoLock::AutoLock (this=0x7f39c558c860, r=..., mode=1, excl=false)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/legion/legion_types.h:2213
#6  0x00007f49af9a6e8d in Legion::Internal::RegionTreeForest::intersect_index_spaces (this=0xdb9afa0, expressions=..., creator=0x0)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/legion/region_tree.cc:5603
#7  0x00007f49af9a69fa in Legion::Internal::RegionTreeForest::intersect_index_spaces (this=0xdb9afa0, lhs=0x7f39fe754320, rhs=0x7f459c0aa240)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/legion/region_tree.cc:5537
#8  0x00007f49af8d7d82 in Legion::Internal::EquivalenceSet::ray_trace_equivalence_sets (this=0x7f44eb744a60, target=0x7f3aac43f460, expr=0x7f39fe754320, ray_mask=...,
    handle=..., source=0, trace_done=..., deferral_event=...)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/legion/legion_analysis.cc:9073
#9  0x00007f49af8f247f in Legion::Internal::EquivalenceSet::handle_ray_trace (args=0x7f407eda2c50, runtime=0xdb946c0)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/legion/legion_analysis.cc:13552
#10 0x00007f49afa8bed9 in Legion::Internal::Runtime::legion_runtime_task (args=0x7f407eda2c50, arglen=100, userdata=0xdb6f460, userlen=8, p=...)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/legion/runtime.cc:24686
#11 0x00007f49b02c2591 in Realm::LocalTaskProcessor::execute_task (this=0xda99dc0, func_id=4, task_args=...)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/realm/proc_impl.cc:1099
#12 0x00007f49b0106fd2 in Realm::Task::execute_on_processor (this=0x7f39c82b5490, p=...)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/realm/tasks.cc:306
#13 0x00007f49b010bc24 in Realm::UserThreadTaskScheduler::execute_task (this=0xda99fb0, task=0x7f39c82b5490)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/realm/tasks.cc:1648
#14 0x00007f49b0109dda in Realm::ThreadedTaskScheduler::scheduler_loop (this=0xda99fb0)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/realm/tasks.cc:1129
#15 0x00007f49b01110ba in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop> (obj=0xda99fb0)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/realm/threads.inl:97
#16 0x00007f49b00e4275 in Realm::UserThread::uthread_entry ()
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/realm/threads.cc:1158
#17 0x00007f49b1435010 in ?? () from /lib64/libc.so.6
#18 0x0000000000000000 in ?? ()

Thread 30 (Thread 0x7f45b0611780 (LWP 36420)):
#0  Realm::DynamicTable<Realm::DynamicTableAllocator<Realm::SparsityMapImplWrapper, 10ul, 4ul> >::lookup_entry (this=0xdb869e0, index=16777232, owner=0,
    free_list=0xd9c9960)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/realm/dynamic_table.inl:202
#1  0x00007f49afe9badc in Realm::DynamicTableFreeList<Realm::DynamicTableAllocator<Realm::SparsityMapImplWrapper, 10ul, 4ul> >::alloc_entry (this=0xd9c9960)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/realm/dynamic_table.inl:333
#2  0x00007f49afe8f1b8 in Realm::RuntimeImpl::get_available_sparsity_impl (this=0xdabcd50, target_node=0)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/realm/runtime_impl.cc:2406
#3  0x00007f49b02417e6 in Realm::IntersectionOperation<1, long long>::add_intersection (this=0x7f39b8d800b0, ops=...)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/realm/deppart/setops.cc:1682
#4  0x00007f49b024413b in Realm::IndexSpace<1, long long>::compute_intersection (subspaces=..., result=..., reqs=..., wait_on=...)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/realm/deppart/setops.cc:516
#5  0x00007f49afb769b5 in Legion::Internal::IndexSpaceIntersection<1, long long>::IndexSpaceIntersection (this=0x7f39b8d8a900, to_inter=..., ctx=0xdb9afa0)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/legion/region_tree.inl:1739
#6  0x00007f49afa2ba1d in Legion::Internal::IntersectionOpCreator::demux<Realm::DynamicTemplates::Int<1>, long long> (creator=0x7f39d1ce5ea0)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/legion/region_tree.h:1459
#7  0x00007f49afa29f04 in Realm::DynamicTemplates::ListProduct2<Realm::DynamicTemplates::IntList<1, 3>, Realm::DynamicTemplates::TypeListElem<int, Realm::DynamicTemplates::TypeListElem<unsigned int, Realm::DynamicTemplates::TypeListElem<long long, Realm::DynamicTemplates::TypeListTerm> > > >::DemuxHelper2<Legion::Internal::IntersectionOpCreator, Realm::DynamicTemplates::Int<1> >::demux<long long, Legion::Internal::IntersectionOpCreator*> (arg1=0x7f39d1ce5ea0)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/realm/dynamic_templates.inl:266
#8  0x00007f49afa27681 in Realm::DynamicTemplates::TypeListElem<long long, Realm::DynamicTemplates::TypeListTerm>::DemuxHelper<Realm::DynamicTemplates::ListProduct2<Realm::DynamicTemplates::IntList<1, 3>, Realm::DynamicTemplates::TypeListElem<int, Realm::DynamicTemplates::TypeListElem<unsigned int, Realm::DynamicTemplates::TypeListElem<long long, Realm::DynamicTemplates::TypeListTerm> > > >::DemuxHelper2<Legion::Internal::IntersectionOpCreator, Realm::DynamicTemplates::Int<1> >, 2>::demux<Legion::Internal::IntersectionOpCreator*> (index=2, arg1=0x7f39d1ce5ea0)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/realm/dynamic_templates.inl:37
#9  0x00007f49afa25264 in Realm::DynamicTemplates::TypeListElem<unsigned int, Realm::DynamicTemplates::TypeListElem<long long, Realm::DynamicTemplates::TypeListTerm> >::DemuxHelper<Realm::DynamicTemplates::ListProduct2<Realm::DynamicTemplates::IntList<1, 3>, Realm::DynamicTemplates::TypeListElem<int, Realm::DynamicTemplates::TypeListElem<unsigned int, Realm::DynamicTemplates::TypeListElem<long long, Realm::DynamicTemplates::TypeListTerm> > > >::DemuxHelper2<Legion::Internal::IntersectionOpCreator, Realm::DynamicTemplates::Int<1> >, 1>::demux<Legion::Internal::IntersectionOpCreator*> (index=2, arg1=0x7f39d1ce5ea0)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/realm/dynamic_templates.inl:39
#10 0x00007f49afa20a96 in Realm::DynamicTemplates::TypeListElem<int, Realm::DynamicTemplates::TypeListElem<unsigned int, Realm::DynamicTemplates::TypeListElem<long long, Realm::DynamicTemplates::TypeListTerm> > >::DemuxHelper<Realm::DynamicTemplates::ListProduct2<Realm::DynamicTemplates::IntList<1, 3>, Realm::DynamicTemplates::TypeListElem<int, Realm::DynamicTemplates::TypeListElem<unsigned int, Realm::DynamicTemplates::TypeListElem<long long, Realm::DynamicTemplates::TypeListTerm> > > >::DemuxHelper2<Legion::Internal::IntersectionOpCreator, Realm::DynamicTemplates::Int<1> >, 0>::demux<Legion::Internal::IntersectionOpCreator*> (index=2, arg1=0x7f39d1ce5ea0)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/realm/dynamic_templates.inl:39
#11 0x00007f49afa1b616 in Realm::DynamicTemplates::TypeListElem<int, Realm::DynamicTemplates::TypeListElem<unsigned int, Realm::DynamicTemplates::TypeListElem<long long, Realm::DynamicTemplates::TypeListTerm> > >::demux<Realm::DynamicTemplates::ListProduct2<Realm::DynamicTemplates::IntList<1, 3>, Realm::DynamicTemplates::TypeListElem<int, Realm::DynamicTemplates::TypeListElem<unsigned int, Realm::DynamicTemplates::TypeListElem<long long, Realm::DynamicTemplates::TypeListTerm> > > >::DemuxHelper2<Legion::Internal::IntersectionOpCreator, Realm::DynamicTemplates::Int<1> >, Legion::Internal::IntersectionOpCreator*> (index=2, arg1=0x7f39d1ce5ea0)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/realm/dynamic_templates.inl:79
#12 0x00007f49afa0f5d9 in Realm::DynamicTemplates::ListProduct2<Realm::DynamicTemplates::IntList<1, 3>, Realm::DynamicTemplates::TypeListElem<int, Realm::DynamicTemplates::TypeListElem<unsigned int, Realm::DynamicTemplates::TypeListElem<long long, Realm::DynamicTemplates::TypeListTerm> > > >::DemuxHelper1<Legion::Internal::IntersectionOpCreator>::demux<Realm::DynamicTemplates::Int<1>, Legion::Internal::IntersectionOpCreator*> (tag=258, arg1=0x7f39d1ce5ea0)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/realm/dynamic_templates.inl:291
#13 0x00007f49af9fadd7 in Realm::DynamicTemplates::IntList<1, 3>::DemuxHelper<Realm::DynamicTemplates::ListProduct2<Realm::DynamicTemplates::IntList<1, 3>, Realm::DynamicTemplates::TypeListElem<int, Realm::DynamicTemplates::TypeListElem<unsigned int, Realm::DynamicTemplates::TypeListElem<long long, Realm::DynamicTemplates::TypeListTerm> > > >::DemuxHelper1<Legion::Internal::IntersectionOpCreator>, 1>::demux<unsigned int, Legion::Internal::IntersectionOpCreator*> (index=1, arg1=258, arg2=0x7f39d1ce5ea0)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/realm/dynamic_templates.inl:158
#14 0x00007f49af9ed775 in Realm::DynamicTemplates::IntList<1, 3>::demux<Realm::DynamicTemplates::ListProduct2<Realm::DynamicTemplates::IntList<1, 3>, Realm::DynamicTemplates::TypeListElem<int, Realm::DynamicTemplates::TypeListElem<unsigned int, Realm::DynamicTemplates::TypeListElem<long long, Realm::DynamicTemplates::TypeListTerm> > > >::DemuxHelper1<Legion::Internal::IntersectionOpCreator>, unsigned int, Legion::Internal::IntersectionOpCreator*> (index=1, arg1=258, arg2=0x7f39d1ce5ea0)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/realm/dynamic_templates.inl:232
#15 0x00007f49af9ddc9a in Realm::DynamicTemplates::ListProduct2<Realm::DynamicTemplates::IntList<1, 3>, Realm::DynamicTemplates::TypeListElem<int, Realm::DynamicTemplates::TypeListElem<unsigned int, Realm::DynamicTemplates::TypeListElem<long long, Realm::DynamicTemplates::TypeListTerm> > > >::demux<Legion::Internal::IntersectionOpCreator, Legion::Internal::IntersectionOpCreator*> (tag=258, arg1=0x7f39d1ce5ea0)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/realm/dynamic_templates.inl:317
#16 0x00007f49af9db911 in Legion::Internal::IntersectionOpCreator::create_operation (this=0x7f39d1ce5ea0)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/legion/region_tree.h:1464
#17 0x00007f49af9aa564 in Legion::Internal::OperationCreator::consume (this=0x7f39d1ce5ea0)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/legion/region_tree.cc:6454
#18 0x00007f49af9ab57c in Legion::Internal::ExpressionTrieNode::find_or_create_operation (this=0x7f39b8d89d30, expressions=..., creator=...)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/legion/region_tree.cc:6687
#19 0x00007f49af9a6fb5 in Legion::Internal::RegionTreeForest::intersect_index_spaces (this=0xdb9afa0, expressions=..., creator=0x0)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/legion/region_tree.cc:5616
#20 0x00007f49af9a69fa in Legion::Internal::RegionTreeForest::intersect_index_spaces (this=0xdb9afa0, lhs=0x7f39ca4abda0, rhs=0x7f459c1493b0)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/legion/region_tree.cc:5537
#21 0x00007f49af8d7d82 in Legion::Internal::EquivalenceSet::ray_trace_equivalence_sets (this=0x7f44eb626720, target=0x7f3ae6e28980, expr=0x7f39ca4abda0, ray_mask=...,
    handle=..., source=0, trace_done=..., deferral_event=...)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/legion/legion_analysis.cc:9073
#22 0x00007f49af8f247f in Legion::Internal::EquivalenceSet::handle_ray_trace (args=0x7f3dd52d1060, runtime=0xdb946c0)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/legion/legion_analysis.cc:13552
#23 0x00007f49afa8bed9 in Legion::Internal::Runtime::legion_runtime_task (args=0x7f3dd52d1060, arglen=100, userdata=0xdb6f6a0, userlen=8, p=...)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/legion/runtime.cc:24686
#24 0x00007f49b02c2591 in Realm::LocalTaskProcessor::execute_task (this=0xda9a3e0, func_id=4, task_args=...)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/realm/proc_impl.cc:1099
#25 0x00007f49b0106fd2 in Realm::Task::execute_on_processor (this=0x7f3a637e79f0, p=...)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/realm/tasks.cc:306
#26 0x00007f49b010bc24 in Realm::UserThreadTaskScheduler::execute_task (this=0xda9a5d0, task=0x7f3a637e79f0)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/realm/tasks.cc:1648
#27 0x00007f49b0109dda in Realm::ThreadedTaskScheduler::scheduler_loop (this=0xda9a5d0)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/realm/tasks.cc:1129
#28 0x00007f49b01110ba in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop> (obj=0xda9a5d0)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/realm/threads.inl:97
#29 0x00007f49b00e4275 in Realm::UserThread::uthread_entry ()
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/realm/threads.cc:1158
#30 0x00007f49b1435010 in ?? () from /lib64/libc.so.6
#31 0x0000000000000000 in ?? ()

Threads 29 to 2 have this trace (which isn't wait but i will not paste them all since its still a wait). I didn't check every line but they're all in Realm::CondVar::timedwait:

Thread 29 (Thread 0x7f45b0605780 (LWP 36421)):
#0  0x00007f49b241bd12 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f49b00dd1f0 in Realm::CondVar::timedwait (this=0xda9ae20, max_nsec=1000000000)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/realm/mutex.cc:193
#2  0x00007f49b01082d0 in Realm::ThreadedTaskScheduler::WorkCounter::wait_for_work (this=0xda9adc8, old_counter=1357932)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/realm/tasks.cc:700
#3  0x00007f49b010a394 in Realm::ThreadedTaskScheduler::wait_for_work (this=0xda9abf0, old_work_counter=1357932)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/realm/tasks.cc:1255
#4  0x00007f49b010bf18 in Realm::UserThreadTaskScheduler::wait_for_work (this=0xda9abf0, old_work_counter=1357932)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/realm/tasks.cc:1756
#5  0x00007f49b010a262 in Realm::ThreadedTaskScheduler::scheduler_loop (this=0xda9abf0)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/realm/tasks.cc:1221
#6  0x00007f49b01110ba in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop> (obj=0xda9abf0)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/realm/threads.inl:97
#7  0x00007f49b00e4275 in Realm::UserThread::uthread_entry ()
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/realm/threads.cc:1158
#8  0x00007f49b1435010 in ?? () from /lib64/libc.so.6
#9  0x0000000000000000 in ?? ()
LonelyCat124 commented 4 years ago

I ran with other testcases, and it seems to always reach this hang, but the time taken to reach the hang seems to depend on input size. The prior run was ~550k particles, resulting in ~22k tasks per top-level launch, and 5-6 top-level launches per step. I ran a middle sized run (~250k particles, probably 10k tasks per top-level launch) which hangs after ~150 or so timesteps.

Finally, I ran a 25k particle case,, with only <=729 tasks per top-level launch which ran for 10976 timesteps (15 hours). The thread traces for these threads is the same/similar (debugging off) problem as for the large case:

 33   Thread 0x2ada4ff54780 (LWP 44853) "terra" 0x00002ada50a55ee0 in Realm::DynamicTable<Realm::DynamicTableAllocator<Realm::SparsityMapImplWrapper, 10ul, 4ul> >::lookup_entry(int, int, Realm::DynamicTableFreeList<Realm::DynamicTableAllocator<Realm::SparsityMapImplWrapper, 10ul, 4ul> >*) ()
   from /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/bindings/regent/libregent.so
  32   Thread 0x2ada4ff60780 (LWP 44854) "terra" 0x00002ada50d87f51 in Realm::FastReservation::rdlock_slow(Realm::FastReservation::WaitMode) ()
   from /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/bindings/regent/libregent.so
  31   Thread 0x2ada51f7d780 (LWP 44855) "terra" 0x00002ada50d87f51 in Realm::FastReservation::rdlock_slow(Realm::FastReservation::WaitMode) ()
   from /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/bindings/regent/libregent.so
  30   Thread 0x2ada51fb4780 (LWP 44856) "terra" 0x00002ada50d87f51 in Realm::FastReservation::rdlock_slow(Realm::FastReservation::WaitMode) ()
   from /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/bindings/regent/libregent.so
  29   Thread 0x2ada52245780 (LWP 44857) "terra" 0x00002ada4eac1d12 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
  28   Thread 0x2ada52251780 (LWP 44858) "terra" 0x00002ada4eac1d12 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
  27   Thread 0x2ada53646780 (LWP 44859) "terra" 0x00002ada4eac1d12 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
  26   Thread 0x2ada53652780 (LWP 44860) "terra" 0x00002ada4eac1d12 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
  25   Thread 0x2ada5365e780 (LWP 44861) "terra" 0x00002ada4eac1d12 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
  24   Thread 0x2ada5366a780 (LWP 44862) "terra" 0x00002ada4eac1d12 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
  23   Thread 0x2ada53676780 (LWP 44863) "terra" 0x00002ada4eac1d12 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
  22   Thread 0x2ada5a4ba780 (LWP 44864) "terra" 0x00002ada4eac1d12 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
  21   Thread 0x2ada5a4c6780 (LWP 44865) "terra" 0x00002ada4eac1d12 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
  20   Thread 0x2ada5a4d2780 (LWP 44866) "terra" 0x00002ada4eac1d12 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
  19   Thread 0x2ada5a4de780 (LWP 44867) "terra" 0x00002ada4eac1d12 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
  18   Thread 0x2ade4ed11780 (LWP 44868) "terra" 0x00002ada4eac1d12 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
  17   Thread 0x2ade4ed1d780 (LWP 44869) "terra" 0x00002ada4eac1d12 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
  16   Thread 0x2ade4ed29780 (LWP 44870) "terra" 0x00002ada4eac1d12 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
  15   Thread 0x2ade4ed35780 (LWP 44871) "terra" 0x00002ada4eac1d12 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
  14   Thread 0x2ade4ed41780 (LWP 44872) "terra" 0x00002ada4eac1d12 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
  13   Thread 0x2ade4ed4d780 (LWP 44873) "terra" 0x00002ada4eac1d12 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
  12   Thread 0x2ade4ed59780 (LWP 44874) "terra" 0x00002ada4eac1d12 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
  11   Thread 0x2ade4ed65780 (LWP 44875) "terra" 0x00002ada4eac1d12 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
  10   Thread 0x2ade4ed71780 (LWP 44876) "terra" 0x00002ada4eac1d12 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
  9    Thread 0x2ade4ed7d780 (LWP 44877) "terra" 0x00002ada4eac1d12 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
  8    Thread 0x2ade4ed89780 (LWP 44878) "terra" 0x00002ada4eac1d12 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
  7    Thread 0x2ade4ed95780 (LWP 44879) "terra" 0x00002ada4eac1d12 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
  6    Thread 0x2ade4eda1780 (LWP 44880) "terra" 0x00002ada4eac1d12 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
  5    Thread 0x2ade4edad780 (LWP 44881) "terra" 0x00002ada4eac1d12 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
  4    Thread 0x2ade4edb9780 (LWP 44882) "terra" 0x00002ada4eac1d12 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
  3    Thread 0x2ade4edc5780 (LWP 44883) "terra" 0x00002ada4eac1d12 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
  2    Thread 0x2ade4edd1780 (LWP 44884) "terra" 0x00002ada4eac1d12 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
  1    Thread 0x2ada4e06f480 (LWP 44696) "terra" 0x00002ada4eac1965 in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0

Is it possible I'm doing something wrong in my code and leaking Legion objects in the DynamicTable, meaning they don't get reused and the Tree become way too big to become addressable? The only thing with that idea is the hang is somehow in

    while(index >= elems_addressable) {
      level_needed++;
      elems_addressable <<= ALLOCATOR::INNER_BITS;
    }

And the index value shown in the original trace (in binary) 0000 0001 0000 0000 0000 0000 0001 0000, which means as long as ALLOCATOR::INNER_BITS is < 8 it would be addressible safely (otherwise I think you could go from 80 0000xF/8388608 left shifting 9+ and overflowing back to 0, does that make sense? Unfortunately I couldn't track down where INNER_BITS is declared, so short of rerunning a debug run on Monday I can't do more now.

streichler commented 4 years ago

@LonelyCat124 can you confirm you're actually hung in that loop? With the debugger, can you see the values of index, level_needed and elems_addressable?

LonelyCat124 commented 4 years ago

Yeah I'm pretty sure its hung in that loop:

(gdb) p level_needed
$3 = -1833681769
(gdb) p elems_addressable
$4 = 0
(gdb) p index
$5 = 16777216
streichler commented 4 years ago

Yep, that's bad. Please try the following patch:

diff --git a/runtime/realm/dynamic_table.inl b/runtime/realm/dynamic_table.inl
index 8c6c441..9e3f406 100644
--- a/runtime/realm/dynamic_table.inl
+++ b/runtime/realm/dynamic_table.inl
@@ -138,8 +138,8 @@ namespace Realm {
   {
     // first, figure out how many levels the tree must have to find our index
     int level_needed = 0;
-    int elems_addressable = 1 << ALLOCATOR::LEAF_BITS;
-    while(index >= elems_addressable) {
+    size_t elems_addressable = 1 << ALLOCATOR::LEAF_BITS;
+    while(size_t(index) >= elems_addressable) {
       level_needed++;
       elems_addressable <<= ALLOCATOR::INNER_BITS;
     }

Although the hang is a Realm bug, we should probably also be looking at why the application has created 16M sparsity maps. @LonelyCat124 is your application recomputing partitions after the initial timestep?

LonelyCat124 commented 4 years ago

Will give that a go tomorrow morning.

Yes the current code creates a lot of partitions. I naively use partitions to manage which particles are in which cells as they move, and thus repartition twice per timestep. I assumed that since I have __delete(partition)s once a partition is finished with this shouldn't cause problems. If regularly creating partitions is bad I'll look into another way to implement this

LonelyCat124 commented 4 years ago

It still got stuck with that patch - I'm rerunning with debugging now and will check whats happening again, and update this post.

UPDATE: Ok, so I don't quite understand what is happening to make this occur. Top of stack:

#0  Realm::DynamicTable<Realm::DynamicTableAllocator<Realm::SparsityMapImplWrapper, 10ul, 4ul> >::lookup_entry (this=0xe4ecb00, index=16777216, owner=0,
    free_list=0xe30f4d0)
    at /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/realm/dynamic_table.inl:202
202           level_needed++

From my understanding this means that INNER_BITS == 10 and LEAF_BITS ==4. Again:

(gdb) p index
$13 = 16777216
(gdb) p elems_addressable
$14 = 0
(gdb) p level_needed
$15 = 469354815

Code starts with 1<<4 and repeats <<10 until it should succeed:

(gdb) p 1 << 4
$16 = 16
(gdb) p 1 << 14
$17 = 16384
(gdb) p 1 << 24
$18 = 16777216
(gdb) p index
$19 = 16777216
(gdb) p (size_t)(1) << 34
$20 = 17179869184
(gdb) p (size_t)index >= (size_t)(1) << 34
$21 = false

So this should definitely work, but isn't for some reason. Am rebuilding again, this time with a new terra install as well, I'm building Regent with: ./install.py --clean -j 4 --hdf5 --debug As far as I can tell that should clear out any old built files right?

Edit: I realised its the same "bug" occuring another place. I applied the same change to runtime/realm/dynamic_table.inl line 202 and retesting

LonelyCat124 commented 4 years ago

Full patch I used (tentatively is working)

--- a/runtime/realm/dynamic_table.inl
+++ b/runtime/realm/dynamic_table.inl
@@ -138,8 +138,8 @@ namespace Realm {
   {
     // first, figure out how many levels the tree must have to find our index
     int level_needed = 0;
-    int elems_addressable = 1 << ALLOCATOR::LEAF_BITS;
-    while(index >= elems_addressable) {
+    size_t elems_addressable = 1 << ALLOCATOR::LEAF_BITS;
+    while(size_t(index) >= elems_addressable) {
       level_needed++;
       elems_addressable <<= ALLOCATOR::INNER_BITS;
     }
@@ -197,8 +197,8 @@ namespace Realm {
   {
     // first, figure out how many levels the tree must have to find our index
     int level_needed = 0;
-    int elems_addressable = 1 << ALLOCATOR::LEAF_BITS;
-    while(index >= elems_addressable) {
+    size_t elems_addressable = 1 << ALLOCATOR::LEAF_BITS;
+    while(size_t(index) >= elems_addressable) {
       level_needed++;
       elems_addressable <<= ALLOCATOR::INNER_BITS;
     }

Also please let me know if I shouldn't be regularly creating partitions/how is better to code with Legion/Regent

LonelyCat124 commented 4 years ago

The code is now silently failing after 200 steps (~2.5 hours without debug enabled). With debugging enabled this would take ~100 hours, so before I consider that (I think my wallclock limit is 48 hours without an extra request. As a side note can I build Regent with -g without disabling optimizations just to so stack traces are viewable?) I'm rerunning with core file size set to unlimited and seeing if that yields anything. I'm also doing another run with LEGION_FREEZE_ON_ERROR=1.

My only concern right now is that I don't appear to get anything in standard error or standard out about any kind of error, so I think these may not necessarily yield anything.

streichler commented 4 years ago

How many steps did this one run before hanging before? Is there any indication at all of why the job failed? In particular, can you check to see if it ran out of memory?

LonelyCat124 commented 4 years ago

Before the change it hang after 5 steps.

I just logged into the running node and your guess on memory seems to probably be correct, this run is around that 200 steps mark and now using 96.4% memory (I believe before the hang at 5 steps was ~30%).

I'll run a short number of steps (20?) with Legion GC I guess? Do I need to do anything special for Regent to use that or just add -level legion_gc=2 -logfile gc_%.log to the command line? Also does it require debug mode enabled?

LonelyCat124 commented 4 years ago

Ok - I built Regent with CC_FLAGS=-DLEGION_GC ./install.py --clean -j 32 --with-terra terralang --hdf5 --debug

and tried to run with Legion GC. Unfortunately I think the patch above breaks some of the debug assertions perhaps:

terra: /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/realm/dynamic_table.inl:251: Realm::DynamicTable<ALLOCATOR>::ET* Realm::DynamicTable<ALLOCATOR>::lookup_entry(Realm::DynamicTable<ALLOCATOR>::IT, int, typename ALLOCATOR::FreeList*) [with ALLOCATOR = Realm::DynamicTableAllocator<Realm::SparsityMapImplWrapper, 10, 4>; Realm::DynamicTable<ALLOCATOR>::ET = Realm::SparsityMapImplWrapper; Realm::DynamicTable<ALLOCATOR>::IT = int; typename ALLOCATOR::FreeList = Realm::DynamicTableFreeList<Realm::DynamicTableAllocator<Realm::SparsityMapImplWrapper, 10, 4> >]: Assertion `(level_needed <= n->level) && (index >= n->first_index) && (index <= n->last_index)' failed.
terra: /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/realm/dynamic_table.inl:251: Realm::DynamicTable<ALLOCATOR>::ET* Realm::DynamicTable<ALLOCATOR>::lookup_entry(Realm::DynamicTable<ALLOCATOR>::IT, int, typename ALLOCATOR::FreeList*) [with ALLOCATOR = Realm::DynamicTableAllocator<Realm::SparsityMapImplWrapper, 10, 4>; Realm::DynamicTable<ALLOCATOR>::ET = Realm::SparsityMapImplWrapper; Realm::DynamicTable<ALLOCATOR>::IT = int; typename ALLOCATOR::FreeList = Realm::DynamicTableFreeList<Realm::DynamicTableAllocator<Realm::SparsityMapImplWrapper, 10, 4> >]: Assertion `(level_needed <= n->level) && (index >= n->first_index) && (index <= n->last_index)' failed.
terra: /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/realm/dynamic_table.inl:251: Realm::DynamicTable<ALLOCATOR>::ET* Realm::DynamicTable<ALLOCATOR>::lookup_entry(Realm::DynamicTable<ALLOCATOR>::IT, int, typename ALLOCATOR::FreeList*) [with ALLOCATOR = Realm::DynamicTableAllocator<Realm::SparsityMapImplWrapper, 10, 4>; Realm::DynamicTable<ALLOCATOR>::ET = Realm::SparsityMapImplWrapper; Realm::DynamicTable<ALLOCATOR>::IT = int; typename ALLOCATOR::FreeList = Realm::DynamicTableFreeList<Realm::DynamicTableAllocator<Realm::SparsityMapImplWrapper, 10, 4> >]: Assertion `(level_needed <= n->level) && (index >= n->first_index) && (index <= n->last_index)' failed.
~                                        

I don't have the binary so I don't think inspecting the core file will work. @streichler is this related to the patch? I'm modifying the assert slightly to see if it will print me some values (since its around 2.5 hours to the assert failure, at the same point as the prior hang).

Edit: My first look here didn't find anything, I was only printing level_needed and n->level, and this looked to satisfy level_needed <= n->level. I'm uncommented the DEBUG_REALM requirement for this section, and am rerunning with a full cerr output of all the values used in the asser to see if I can track it down.

streichler commented 4 years ago

Yes, there's an overflow issue there as well. You can comment out that entire assert (there's two occurrences of it) for now.

For the configuration that used to hang after 5 steps, do you know how many partitions you had computed and how many subregions each of those partitions had? @lightsighter and I need to sit down and figure out what's needed to start reclaiming sparsity maps that are no longer needed, but I'm concerned that if the application needs 16M sparsity maps for only 5 time steps, the memory and computation cost for those sparsity maps is going to be dominating everything.

LonelyCat124 commented 4 years ago

Ok, I've removed those and rerunning the run with Legion GC. The output of the values for those asserts are overflowing in the way you probably expect:

Failure, level needed:3, n->level: 3
index: 16777232, first_index: 0
last_index: -1

That configuration would have constructed at most 11 partitions for the timestepping loop (1 initial, and then 2 per timestep). The partitions would have been: 3x 14x7x7 (686 elements) 2x 11x11x5 (605 elements) 6x 12x6x6 (432 elements)

I also just realised it creates another partition per timestep (which I'm leaking...I'll fix that before the GC run) of size 4. Perhaps the leaking partition was causing the hang? If it may have been I can revert the patch and retry.

The partition sizes are chosen to spatially group particles while ensuring >= 200 particle per cell (average) to ensure tasks are big enough to have enough work to enable parallelism (is the idea, the reality is other things limit parallelism anyway I think). I've just tried to implement a basic cell-list/link-cell algorithm to test other functionality for now.

LonelyCat124 commented 4 years ago

Legion GC output with the removed partition I wasn't deleting seems fine - I'm going to attempt to rerun the full simulation but here is the summary:

LEAK SUMMARY
  LEAKED FUTURES: 156
  Leaked Future Maps: 0
  LEAKED CONSTRAINTS: 1
  LEAKED MANAGERS: 1
  Pinned Managers: 0
  Leaked Views: 0
  Leaked Equivalence Sets: 0
  LEAKED INDEX SPACES: 1
  Leaked Index Partitions: 0
  LEAKED FIELD SPACES: 1
  Leaked Regions: 0
  Leaked Partitions: 0
elliottslaughter commented 4 years ago

So what would help most is a comparison between two short runs, say N iterations and 2N (for some relatively short N). What we're interested in is mainly what's growing, so fixed sources of leaks are not a concern.

The other thing that would help would be a valgrind run. For this it would help to dump an executable and run that, since we really don't want to run the JIT through valgrind. But again, the same idea: run N and 2N iterations, so we can see what's growing and what stays the same.

streichler commented 4 years ago

I think this is probably good enough to rule out egregious leaks in the application itself. I know we're currently leaking sparsity maps in Realm, and the main open question in my mind is whether fixing that will help, or whether our real problem is an enormous number of sparsity maps being created (e.g. due to equivalence class issues).

LonelyCat124 commented 4 years ago

I reran the real simulation again, and the code is still crashing after ~20 steps (I'm assuming my maths was out by a factor of 10 before when I said 200 steps, I only added the actual iteration counter last night for some reason). Based on how the performance is behaving I'm reasonably sure this is a leak of some description since performance degrades massively for the final steps, in line with the OS dumping stuff to the page file (which also has massive usage at that point).

I will run for 3 and 6 iterations with legion_gc, and then valgrind. I will probably also look into changing algorithm to reduce/remove the need to repartition (and improve performance as a side-effect) when I have opportunity, but that's a medium-term fix on my end.

LonelyCat124 commented 4 years ago

3 step gc output:

LEAK SUMMARY
  LEAKED FUTURES: 90
  Leaked Future Maps: 0
  LEAKED CONSTRAINTS: 1
  LEAKED MANAGERS: 1
  Pinned Managers: 0
  Leaked Views: 0
  Leaked Equivalence Sets: 0
  LEAKED INDEX SPACES: 1
  Leaked Index Partitions: 0
  LEAKED FIELD SPACES: 1
  Leaked Regions: 0
  Leaked Partitions: 0

6 step output:

LEAK SUMMARY
  LEAKED FUTURES: 156
  Leaked Future Maps: 0
  LEAKED CONSTRAINTS: 1
  LEAKED MANAGERS: 1
  Pinned Managers: 0
  Leaked Views: 0
  Leaked Equivalence Sets: 0
  LEAKED INDEX SPACES: 1
  Leaked Index Partitions: 0
  LEAKED FIELD SPACES: 1
  Leaked Regions: 0
  Leaked Partitions: 0

Some futures are being leaked but not growing quickly enough to make everything crash I think.

I'm not sure valgrind runs are going to be feasible at the moment, even for 3/6 steps, I had some running for 6 hours and they hadn't started the first step yet (they were busy in the "0th" half-step which initialises a bunch of stuff). I'll set some running for 48 hr and see where they get to though.

streichler commented 4 years ago

@LonelyCat124 I'd like to get @lightsighter 's take on this, but my suspicion is that in the short term, you'll need to use partitions with many fewer subregions (e.g. clumping the particles much more), and that once we have the new equivalence class stuff, we'll want to reassess things to figure out if recovery of sparsity map IDs and memory in Realm is the only remaining issue or whether the overhead of recomputing partitions per timestep is a dealbreaker on its own.

lightsighter commented 4 years ago

I would like to see a Legion Prof profile of this before we make any conclusions.

LonelyCat124 commented 4 years ago

I'll set a profile running for a small number of steps - I expect I'll have to share through google drive or similar.

Next week I'll look to start thingking about/implementing an alternate algorithm that doesn't require repartitioning often (or hopefully ever) so I can use things like trace etc. which I expect I'll need for performance either way.

LonelyCat124 commented 4 years ago

Ok, so that profile (the .gz) is 9GB. Any good way to share that or shall I just run a single timestep and then try that? I expect the number of task launches is way too high right now for a feasible profile.

LonelyCat124 commented 3 years ago

When trying to decode the 9GB profile the profiler process was killed, I assume this was due to an out of memory error (while running on a 128GB RAM node) but I can't be 100% sure. I'm testing for a shorter run to see if I can get that profile (which is 2.1GB). If not I'll use a smaller example for a few more steps.

The valgrind runs all failed to complete in 48 hours - I could valgrind on a smaller example case, though the smallest case takes hundreds (maybe thousands) of timesteps to crash.

I was looking/thinking a bit more about alternative algorithms for (pairwise) particle methods with Legion/Regent to reduce the need for repeated partitioning but I think some partitioning is going to be unavoidable. I looked at a couple of the codes with particles listed in https://legion.stanford.edu/resources/ and I think the only one similar is the Barnes-Hut implementation, which does have to partition every timestep. All of the solutions I can think of would suffer the same issues as shown here for insane cases (some of which are used for numeric tests, e.g. adding the same large velocity to all particles in a steady state to check periodicity works correctly).

I will work on trying some alternative implementations, a couple of Regent questions I'd have:

  1. Is there anyway to do:

    var a_partition = partition(...)
    ...
    a_partition =partition(...)

    I think previously @elliottslaughter said this was not possible. If not then it becomes difficult to use an if criteria to determine if a new partition is needed or not (which is how other particle method implementations would decide whether to regrid/rebuild neighbour lists etc.). Does it change if the partitions use the same index space (but due to partitioning by value would result in different partitions).

  2. Does __demand(__trace) create a trace and then reuse it until the trace is no longer valid (i.e. a new partition is made), and then record a new trace to use until it is no longer useful?

streichler commented 3 years ago

@LonelyCat124 Are you able to share the code and build/run instructions?

Also, are you re-grouping the particles after each timestep or does the code keep the particles in place and operate on ever-more-jumbled subsets of those particles in each task? If the former, we could maybe avoid the partitioning by using explicit gather copies, but we'd need progress on #704 (@elliottslaughter ?) for that to be possible in a Regent application.

LonelyCat124 commented 3 years ago

Yes, the repo is public w/ a dockerfile to set it up (https://github.com/stfc/RegentParticleDSL), but the example I'm working with likely needs a few tweaks to work outside of my own environments (and involves a second repo containing the data) - I can sort that tomorrow morning and update here with instructions on building/running - if dockerfile is sufficient to build let me know or if you want explicit instructions I can sort that out. Its also possible that some of the more recent changes slightly change the behaviour discussed here, but I doubt in any meaningful way (the number/size of partitions hasn't changed, but the amount of parallelism could have improved).

At the moment I believe its re-grouping the particles after each step, the particles move and have their cell_id field updated based upon their position / cell_size. The repartition is then computed with partition(particles.neighbour_part_space.cell_id, ispace(int3d, ...).

elliottslaughter commented 3 years ago

@LonelyCat124: You can __import_partition multiple times; it's not pretty, but it's as close as we have to assignment at the moment without dedicated support (which is probably not coming soon).

The other option would seem to be trade queues, which is less idiomatic but should be efficient. That's what Soleil-X implements, and we can find a code sample if you're interested.

streichler commented 3 years ago

@LonelyCat124 a dockerfile is great - please let me know once you've got one with the necessary changes

LonelyCat124 commented 3 years ago

Ok - this dockerfile https://github.com/stfc/RegentParticleDSL/blob/master/runs/Dockerfile sets up everything needed to run this example. The image itself just launches an interactive session with docker run ... and then to run the example you need to do:

cd /RegentParticleDSL && regent src/interactions/MinimalSPH/program.rg

If you need a smaller example, you'll need to do:

python3.6 -m pip install -U pip
python3.6 -m pip install swiftsimio
cd /swiftsim/examples/HydroTests/PerturbedBox_3D
python3 makeIC.py 20
cd /RegentParticleDSL/
export SODSHOCK_INPUT=/swiftsim/examples/HydroTests/PerturbedBox_3D/perturbedBox.hdf5
##Run the program with new smaller input (20^3 parts)
regent src/interactions/MinimalSPH/program.rg

For the smaller example to get it to crash you probably would need to change line 74 of src/interactions/MinimalSPH/program.rg to var endtime : double = 10000.0

I'll take a look at __import_partition for now and see if that is sufficient for what I need. What are the trade queues? I'd happily have a read about them and go from there.

manopapad commented 3 years ago

"Tradequeues" refers to how we implemented support for particles in Regent for Soleil-X (I don't think the term is canonical).

The setup was similar to yours: particles move, their cell pointer is updated based on their new position, and now some of them are on the wrong subregion. But because we didn't expect particles to move significantly from timestep to timestep (at most a half-cell per timestep) we could have each sub-region collect all exiting particles into 26 out-buffers that each gets sent to the appropriate neighboring sub-region. Each sub-region then removes the particles that left and adds the particles coming in.

There's some technical requirements to achieve this in Regent:

Now that I think of it, we could probably achieve the same effect much more cleanly by using explicit scatter/gather copies instead of buffers and trading (contingent on #704, as Sean mentioned).

Here's a couples of slides on this soleil-particles.pdf, and here's most of the relevant code: https://github.com/stanfordhpccenter/soleil-x/blob/b85d6e55b57e985b4a4fb1d4926fa39d51cd7e76/src/soleil.rg#L3378-L3526.

LonelyCat124 commented 3 years ago

Ah that makes sense - sounds similar how to my algorithm would be implemented for a classic C+MPI approach.

I'll try that approach out I think when I reimplement my neighbour search (next on my list, once I've built some tests to catch bugs...)

LonelyCat124 commented 3 years ago

I've started looking at the TradeQueues as an option to remove/reduce repartitioning, and have I don't think they can totally remove repartitioning in my case as I have less control over the stability of the potential input systems over time, for example a relatively common MD testcase would have reasonably homogenous particle density and particle motion between cells over time (so TradeQueues completely solve the need for repartitioning), whilst an SPH Dam Break has a heavily inhomogeneous initial condition, and the regions of particles density moves entirely over time (until it reaches rest) (repartitioning needed still, but much much less frequently than current).

@manopapad One thing I'm not quite clear on from your code is what data type you are using for the tradequeues (due to the UTIL header) - I assume you're just storing indices in the queues for tradequeues[src][dest]?

@streichler I didn't tag you when I posted the dockerfile last week so I just realised you may have missed it. I've made some changes again to the code that fix a few bugs in neighbour search, but shouldn't have changed the repartitioning or anything so the bug should still happen, however I'm going to double check tonight and will let you know.

One final query: If I use __demand(__trace) and later have to repartition - will the trace be used up until the repartition, and then a new trace generated?

manopapad commented 3 years ago

Here's a list of all the fields on the particles region, with notes on how each is handled during trading:

-- these are copied onto the tradequeue when a particle moves
-- and written to the particle's new slot on the target sub-region
cell : int3d;
position : double[3];
velocity : double[3];
temperature : double;
diameter : double;
density : double;
deltaVelocityOverRelaxationTime : double[3];
deltaTemperatureTerm : double;
position_old : double[3];
velocity_old : double[3];
temperature_old : double;
position_new : double[3];
velocity_new : double[3];
temperature_new : double;
-- these are scratch fields that are valid only within an RK substep
-- so as an optimization they are not included in the trading
velocity_t : double[3];
temperature_t : double;
-- this flag is true if a particle slot actually contains a particle
-- this applies to both the main particles region and tradequeues 
__valid : bool;
-- used in TradeQueue_push to mark which direction a particle is moving (values: 0-26 inclusive)
__xfer_dir : int8;
-- temporary with two uses:
-- in TradeQueue_push: which index on the appropriate tradequeue to use for an outgoing particle
-- in TradeQueue_pull: which index on the incoming tradequeues to copy into an empty particle slot
__xfer_slot : int64;

The dam break case sounds interesting. You could potentially follow a two-tier approach, where you size the particles sub-regions according to starting concentration, leaving some extra room to accommodate particle movement up to a point. As long as this initial partition remains sufficient you follow with a "lightweight" strategy like tradequeues or scatter/gather copies. When the concentration shifts too far from the original split you do a "heavyweight" repartitioning (possibly with accompanying compaction within the new partitions), to better align with current concentration.

LonelyCat124 commented 3 years ago

@streichler I fixed a couple of bugs so the Dockerfile linked previously should correctly replicate the bug/hang now.

LonelyCat124 commented 3 years ago

I've started a TradeQueue implementation, and I attempted to keep the particles contiguous in the partitions, however I realised the implementation relied on looping backwards over the arrays, which I don't think is possible? There's no for x in y.ispace.reverse() or similar in Regent/Legion right? I'm not sure its necessarily beneficial to keep the particles contiguous anyway.

Edit: Also is it safe to assume that an equal (1D) partition of a 1D region will result in contiguous indices?

LonelyCat124 commented 3 years ago

@elliottslaughter when using __import_partition should I still delete the partition I'm replacing? This code works unless I uncomment the __delete.

--    __delete([neighbour_init.cell_partition])
    format.println("Creating new partition")
    var n_cells = [variables.config][0].neighbour_config.x_cells * [variables.config][0].neighbour_config.y_cells * [variables.config][0].neighbour_config.z_cells
    var x_cells = [variables.config][0].neighbour_config.x_cells
    var y_cells = [variables.config][0].neighbour_config.y_cells
    var z_cells = [variables.config][0].neighbour_config.z_cells
    var space_parameter = ispace(int3d, {x_cells, y_cells, z_cells}, {0,0,0})
    var raw_lp1 = __raw(partition([neighbour_init.padded_particle_array].neighbour_part_space.cell_id, space_parameter))
    var [neighbour_init.cell_partition] = __import_partition(disjoint, [neighbour_init.padded_particle_array], space_parameter, raw_lp1);

in which case I get Legion error 482

elliottslaughter commented 3 years ago

I think when you __import_partition, you probably shouldn't manually __delete since it came from outside the scope of Regent's analysis.

elliottslaughter commented 3 years ago

I think an equal partition will currently generate contiguous subregions in all cases (and that should certainly hold in the 1D case). So you could loop backwards with a manual for loop.

Otherwise "backwards" doesn't really make sense because there isn't an order defined for multi-dimensional loops in the first place.

LonelyCat124 commented 3 years ago

Ok, I'll remove the `__delete``

Yeah, I think the contiguity shouldn't change performance much so its not a big deal, but for my non-equal partitions I was originally wanting to do an equivalent to

for(x=0; x < num_parts; x++)
  if(!part[x].valid) break;

which would require 0->n_valid were all valid, and all after were not, and the easiest way to create that would be to loop from the end, but since we're iterating over an index space I don't think it makes much sense anyway.

LonelyCat124 commented 3 years ago

@elliottslaughter I think I'm still making a mistake with the repartitioning declaration.

I have function that returns an rquote which contains

--Code using old partition (stored in symbol [neighbour_init.cell_partition])
    var raw_lp1 = __raw(partition([neighbour_init.padded_particle_array].neighbour_part_space.cell_id, space_parameter))
    var [neighbour_init.cell_partition] = __import_partition(disjoint, [neighbour_init.padded_particle_array], space_parameter, raw_lp1);

My assumption was this side effect would be visible, but its not, as I think the new symbol is only visible inside the rquote. Future accesses (outside of the rquote use the old partition, which was also generated using an rquote). Should I declare var [neighbour_init.cell_partition] : partition(...) in the main task and then not redeclare the variable (just set it with [x] = __import_partition , or is there some other way I can do this?

elliottslaughter commented 3 years ago

Right, you have to do an assignment rather than a var here or else it will turn into a separate variable declaration.

I'm also slightly concerned that Regent will delete your raw_lp1 before you want it to, because I'm not sure the escape analysis is smart enough for this case. You may need to switch to using C APIs to generate the partition. (Not fun, I know, but that's how things are at the moment.)

LonelyCat124 commented 3 years ago

Checking I understand - inside main task I'd do:

task main_task()
var [neighbour_init.cell_partition] : --what type goes here? Just a normal partition type as I'd declare for passing to a task?
[quote_generating_function()];

If I need to use the C API where's the best place to look for the declarations I need?

Thanks again for the help with this!

elliottslaughter commented 3 years ago

That would be a regentlib.c.legion_logical_partition_t.

And you can find e.g. the create partition by field call here: https://github.com/StanfordLegion/legion/blob/master/runtime/legion/legion_c.h#L1268-L1283

You might also want to see http://regent-lang.org/reference (search for "Calling the Legion C API", sorry the section links are broken right now).

LonelyCat124 commented 3 years ago

Hmm, I'm still unable to assign/reassign using __import_partition (I've not tried more expansive use of the C API yet). This code:

fspace part{
  thing : int,
  thing2 : int,
  int3d_parameter : int3d
}

task main()

  var i2 = __import_ispace(int3d, __raw(ispace(int3d, {6,6, 6})))
  var r = region(ispace(int1d, 1296), part)
  var part1 : partition(disjoint, r, i2)
  var temp1 = __raw(partition(r.int3d_parameter, i2));
   part1 =  __import_partition(disjoint, r, i2, temp1);

fails with

test.rg:22: type mismatch in assignment: expected partition(disjoint, $r, $i2) but got partition(disjoint, $r, $i2)
   part1 =  __import_partition(disjoint, r, i2, temp1);

Changing instead to:

  var temp1 = __raw(partition(r.int3d_parameter, i2));
  var part1 =  __import_partition(disjoint, r, i2, temp1);
  var temp2 = __raw(partition(r.int3d_parameter, i2));
  part1 = __import_partition(disjoint, r, i2, temp2);

has the same error.

From delving into Regent a bit I think this is caused by the same logic in type checking that prevents:

a = partition(...)
a = partition(...)

From reading about the C API interactions, I'm wondering if this is due to:

Important: C API handles can only be imported once into Regent. Subsequent attempts to import C API handles will fail. All objects created by Regent are considered to be already imported and thus cannot be imported again using this mechanism. These restrictions guarrantee that certain assumptions made by the Regent compiler are not violated, and are critical to ensuring that Regent’s type checker and optimizer work correctly.

I'm going to try building the partition entirely using the C API and see if I can then do subsequent imports to the same variable or not.

Edit: Same issue using the C API directly. I assume the only way I could do this would be to never use __import_partition and use the c legion_local_partition_t always, but that would be basically reinventing the wheel w.r.t Regent at that point. For now I could disable repartitioning, and instead throw an error when the conditions are invalid (and explain why and how to work around it for the user), but the workarounds would likely result in higher memory and/or performance overheads.

LonelyCat124 commented 3 years ago

Ok, so I made a little test code to look at performance vs OpenMP (https://github.com/LonelyCat124/ParticleTest) , and also used the profiler's -s option to dump some statistics to try to see whats happening. For both the program discussed in this issue and the test code I found the main output from the profiler seems to be (data here from program discussed here):

  Defer Task Perform Mapping
       Total Invocations: 13719
       Total Time: 44206057310 us
       Average Time: 3222250.70 us
       Maximum Time: 78887502 us (31415.616 sig)
       Minimum Time: 6 us (-1337.850 sig)

  Task Physical Dependence Analysis
       Total Invocations: 682134
       Total Time: 1395087867 us
       Average Time: 2045.18 us
       Maximum Time: 54498 us (2287.576 sig)
       Minimum Time: 11 us (-88.709 sig)

By comparison, the main task runtimes in the code are a totally unoptimised task and the (expected) heaviest task:

  update_cutoffs_launcher [primary]
       Total Invocations: 2
       Total Time: 544281583 us
       Average Time: 272140791.59 us
       Maximum Time: 296032166 us (4887.880 sig)
       Minimum Time: 248249416 us (-4887.880 sig)

  pair_redo_density [primary]
       Total Invocations: 505856
       Total Time: 469183273 us
       Average Time: 927.50 us
       Maximum Time: 30961 us (870.406 sig)
       Minimum Time: 54 us (-25.307 sig)

Is there something I can look into to see why Defer Task Perform Mapping runtime is so high - and is that active time or can it be just in the stack while other work is happening. One thing I've noticed is that using runtime->create_partition_by_field seems to lead to significantly worse (100+ times slower) performance - is that expected behaviour?

manopapad commented 3 years ago

I'm not sure this statistical analysis is giving the full picture; the runtime analysis tasks typically run concurrently with application tasks, on a separate core. I would suggest you make a full profiler visualization (process the generated profiler log files with tools/legion_prof.py <log-files>) and inspect the resulting html files in a browser.

using runtime->create_partition_by_field seems to lead to significantly worse (100+ times slower) performance

compared to what?

LonelyCat124 commented 3 years ago

Unfortunately I've not been able load the profiles (they're too large, just sits on loading for 12+ hours) in a browser (using python3 -m http.server), which is why i resorted to the statistics. I think I can filter out a small subset of the runtime to hopefully create a small timeline to view.

Some of this seemed to be due to having multiple util threads (notably 4 util threads resulted in 2x slower code than 1 util thread for https://github.com/LonelyCat124/ParticleTest, made no noticable difference for the results above). I was comparing to runtime->create_equal_partition. Now the work done by the tasks overall will be less with an equal partition, however for that code I was seeing: Equal partition 4 util threads: ~9s -- assuming with less util threads this may be lower Field partition 1 util thread: ~25s (averaged). The timed region is: https://github.com/LonelyCat124/ParticleTest/blob/c03454c0d9d28e33de392fb2f7caae97d8024fc6/legion_version.cc#L319-L325

The legion profiler statistics showed me for the partition by field the timed task runtimes:

  Self task [self task]
       Total Invocations: 1000
       Total Time: 1922659 us
       Average Time: 1922.66 us
       Maximum Time: 3187 us (75.660 sig)
       Minimum Time: 949 us (-58.219 sig)

  Timestep task [timestep task]
       Total Invocations: 1000
       Total Time: 108830 us
       Average Time: 108.83 us
       Maximum Time: 814 us (147.933 sig)
       Minimum Time: 68 us (-8.531 sig)

So the compute time is ~2s, which means the difference (and computation time) is I assume from runtime stuff (and I expect caused by my bad code). The OpenMP task comparison code creates local AoS for each cell, and performs the same tasks in ~1.18s in serial

manopapad commented 3 years ago

Thank you for the measurements. However it is hard for me to understand what is causing the slowdown from just these numbers, so I would really like to look at a profile.

Unfortunately I've not been able load the profiles (they're too large, just sits on loading for 12+ hours) in a browser (using python3 -m http.server), which is why i resorted to the statistics. I think I can filter out a small subset of the runtime to hopefully create a small timeline to view.

You can try to reproduce this with shorter runs (or at least remove parts of the computation that are irrelevant for this). You can also use the --start-trim and --stop-trim options on legion_prof.py to process only parts of the log files.

Some of this seemed to be due to having multiple util threads (notably 4 util threads resulted in 2x slower code than 1 util thread for https://github.com/LonelyCat124/ParticleTest, made no noticable difference for the results above)

I have seen this behavior myself. Unfortunately it seems like more util processors is not always a win. This likely has to do with there being enough independent runtime work, and at a large enough granularity, that using >1 threads results in an overall speedup, vs getting slowed down by the added synchronization.

I was comparing [runtime->create_partition_by_field] to runtime->create_equal_partition

The partition-by-field operation itself (where you have to actually read the values of a region's fields) is more heavyweight than the equal-partition operation (where you just draw up the sub-region boundaries using index math), so the slowdown makes some sense there. However, partitioning should be an infrequent operation. Is this partitioning happening on each pass through this code, and is that code executed repeatedly?

The timed region is: https://github.com/LonelyCat124/ParticleTest/blob/c03454c0d9d28e33de392fb2f7caae97d8024fc6/legion_version.cc#L319-L325

Shouldn't Future start = runtime->get_current_time_in_microseconds(ctx); come right after the first fence?