Open elliottslaughter opened 2 months ago
Can you try this again now that the inordercommit
branch has merged? I think this line of code doesn't exist anymore.
Here's what I'm hitting now:
[1 - 7f7f0c144c40] 0.340909 {5}{runtime}: [error 61] LEGION ERROR: Invalid region handle (4c,2,2) for region requirement 0 of task void_leaf (ID 41) (from file /scratch/eslaught/fuzzer/legion/runtime/legion/legion_tasks.cc:1097)
Backtrace:
#7 0x00007f07b123c27a in Legion::Internal::Runtime::report_error_message (id=id@entry=61, file_name=file_name@entry=0x7f07b17ca460 "/scratch/eslaught/fuzzer/legion/runtime/legion/legion_tasks.cc", line=line@entry=1097,
message=message@entry=0x7f073290a4a0 "Invalid region handle (4c,2,2) for region requirement 0 of task void_leaf (ID 37)") at /scratch/eslaught/fuzzer/legion/runtime/legion/runtime.cc:32155
#8 0x00007f07b10c6125 in Legion::Internal::TaskOp::perform_privilege_checks (this=this@entry=0x7f073400f260) at /scratch/eslaught/fuzzer/legion/runtime/legion/legion_tasks.cc:1097
#9 0x00007f07b10e0367 in Legion::Internal::IndividualTask::perform_base_dependence_analysis (this=this@entry=0x7f073400f260) at /scratch/eslaught/fuzzer/legion/runtime/legion/legion_tasks.cc:6074
#10 0x00007f07b10e037d in Legion::Internal::IndividualTask::trigger_dependence_analysis (this=0x7f073400f260) at /scratch/eslaught/fuzzer/legion/runtime/legion/legion_tasks.cc:6064
#11 0x00007f07b0f98763 in Legion::Internal::Operation::execute_dependence_analysis (this=0x7f073400f410) at /scratch/eslaught/fuzzer/legion/runtime/legion/legion_ops.cc:1605
#12 0x00007f07b0eb1e9b in Legion::Internal::InnerContext::process_dependence_stage (this=0x7f0734010ec0) at /scratch/eslaught/fuzzer/legion/runtime/legion/legion_context.cc:8426
#13 0x00007f07b12c4878 in Legion::Internal::Runtime::legion_runtime_task (args=0x7f0730009520, arglen=12, userdata=<optimized out>, userlen=<optimized out>, p=...) at /scratch/eslaught/fuzzer/legion/runtime/legion/runtime.cc:32300
#14 0x00007f07aff47e95 in Realm::LocalTaskProcessor::execute_task (this=0x5584ac1cb710, func_id=4, task_args=...) at /scratch/eslaught/fuzzer/legion/runtime/realm/bytearray.inl:150
#15 0x00007f07aff99f23 in Realm::Task::execute_on_processor (this=0x7f07300093a0, p=...) at /scratch/eslaught/fuzzer/legion/runtime/realm/bytearray.inl:39
This is on Legion commit e20bb9f50c4c6715c39b4c749df076fa934e40b4
If you want to look at a process you can find it here:
Process 1260410 on node c0001.stanford.edu is frozen!
What commit of the fuzzer is this on, and what is the command line for reproducing it?
Fuzzer commit: https://github.com/StanfordLegion/fuzzer/commit/35b7448e945b831a84e7651bb12259e09b35028e (this is still the most recent version).
Command line for Sapling (assuming you want to run on a compute node):
salloc -n 1 -N 1 -c 40 -p cpu --exclusive
REALM_FREEZE_ON_ERROR=1 srun -n 2 build/src/fuzzer -fuzz:seed 0 -fuzz:ops 1000 -ll:util 2 -ll:cpu 3
I have a build on Sapling at /scratch/eslaught/fuzzer
that you can feel free to use if you want. The Legion for this build is under the legion
subdirectory, and the CMake build is in legion/build
. You can rebuild Legion with:
cd /scratch/eslaught/fuzzer/legion/build
make install -j20
cd ../../build
make clean
make -j20
This error mode has gone away now. However, when I run a larger number of tests I still see some very low probability, non-deterministic failures. They do not reproduce when I rerun, so the only way to see this happen is to run the parallel test framework so that we can thrash the machine as hard as possible.
If you're able to look at this now, I've created a set of frozen processes you can look at: c0001
process 2135465
, for example. (There are several frozen processes on the node, you can ps aux | grep fuzzer
to find them.)
Sample backtrace in a frozen process (looks like gdb got a bit confused with some of the arguments, but hopefully still mostly readable):
#8 0x00007f4aa44d5fd6 in __GI___assert_fail (assertion=assertion@entry=0x7f4aa64de782 "partial_invalidations.empty()",
file=file@entry=0x7f4aa64cd408 "/scratch/eslaught/fuzzer/legion/runtime/legion/legion_analysis.cc", line=line@entry=21010,
function=function@entry=0x7f4aa64d9338 "void Legion::Internal::EquivalenceSet::apply_state(Legion::Internal::LegionMap<Legion::Internal::IndexSpaceExpression*, Legion::Internal::FieldMaskSet<Legion::Internal::LogicalView> >&, Legion::Intern"...) at assert.c:101
#9 0x00007f4aa5b51bc3 in Legion::Internal::EquivalenceSet::apply_state (this=<optimized out>, valid_updates=..., initialized_updates=...,
invalidated_updates=..., Python Exception <class 'AttributeError'> 'NoneType' object has no attribute 'pointer':
reduction_updates=std::map with 0 elements, Python Exception <class 'AttributeError'> 'NoneType' object has no attribute 'pointer':
restricted_updates=std::map with 0 elements, Python Exception <class 'AttributeError'> 'NoneType' object has no attribute 'pointer':
released_updates=std::map with 0 elements, precondition_updates=0x0, anticondition_updates=0x0, postcondition_updates=0x0,
read_only_guard_updates=0x7f4a0c4034a0, reduction_fill_guard_updates=0x7f4a0c403510, applied_events=std::vector of length 0, capacity 0,
needs_lock=false, forward_to_owner=false, unpack_references=true, filter_invalidations=false)
at /scratch/eslaught/fuzzer/legion/runtime/legion/legion_utilities.h:3159
#10 0x00007f4aa5b51c9b in Legion::Internal::EquivalenceSet::apply_state (this=0x7f4a2ab61ca0, Python Exception <class 'AttributeError'> 'NoneType' object has no attribute 'pointer':
valid_updates=std::map with 2 elements,
initialized_updates=..., invalidated_updates=..., Python Exception <class 'AttributeError'> 'NoneType' object has no attribute 'pointer':
reduction_updates=std::map with 0 elements, Python Exception <class 'AttributeError'> 'NoneType' object has no attribute 'pointer':
restricted_updates=std::map with 0 elements,
Python Exception <class 'AttributeError'> 'NoneType' object has no attribute 'pointer':
released_updates=std::map with 0 elements, precondition_updates=0x0, anticondition_updates=0x0, postcondition_updates=0x0,
read_only_guard_updates=0x7f4a0c4034a0, reduction_fill_guard_updates=0x7f4a0c403510, applied_events=std::vector of length 0, capacity 0,
needs_lock=true, forward_to_owner=false, unpack_references=true, filter_invalidations=false)
at /scratch/eslaught/fuzzer/legion/runtime/legion/legion_analysis.cc:20910
#11 0x00007f4aa5b62fab in Legion::Internal::EquivalenceSet::unpack_state_and_apply (this=0x7f4a2ab61ca0, derez=..., source=0,
applied_events=std::vector of length 0, capacity 0, forward_to_owner=<optimized out>, filter_invalidations=<optimized out>)
at /scratch/eslaught/fuzzer/legion/runtime/legion/legion_analysis.cc:20147
#12 0x00007f4aa5b63da6 in Legion::Internal::EquivalenceSet::handle_migration (derez=..., runtime=runtime@entry=0x55fa9e6b2370,
source=source@entry=0) at /scratch/eslaught/fuzzer/legion/runtime/legion/legion_analysis.cc:19744
#13 0x00007f4aa5fb1832 in Legion::Internal::Runtime::handle_equivalence_set_migration (this=this@entry=0x55fa9e6b2370, derez=...,
source=source@entry=0) at /scratch/eslaught/fuzzer/legion/runtime/legion/runtime.cc:25814
#14 0x00007f4aa603b067 in Legion::Internal::VirtualChannel::handle_messages (this=0x7f4a2270d8a0, num_messages=1, runtime=0x55fa9e6b2370,
remote_address_space=0, args=0x7f4a18329730 "", arglen=600) at /scratch/eslaught/fuzzer/legion/runtime/legion/runtime.cc:13282
#15 0x00007f4aa603c391 in Legion::Internal::VirtualChannel::process_message (this=0x7f4a2270d8a0, args=0x7f4a18329714, arglen=620,
runtime=0x55fa9e6b2370, remote_address_space=0) at /scratch/eslaught/fuzzer/legion/runtime/legion/runtime.cc:12045
#16 0x00007f4aa603cb40 in Legion::Internal::Runtime::legion_runtime_task (args=0x7f4a18329700, arglen=636, userdata=<optimized out>,
userlen=<optimized out>, p=...) at /scratch/eslaught/fuzzer/legion/runtime/legion/runtime.cc:32345
#17 0x00007f4aa4cba555 in Realm::LocalTaskProcessor::execute_task (this=0x55fa9d4e7c10, func_id=4, task_args=...)
at /scratch/eslaught/fuzzer/legion/runtime/realm/bytearray.inl:150
#18 0x00007f4aa4d0c573 in Realm::Task::execute_on_processor (this=0x7f4a184ac4a0, p=...)
at /scratch/eslaught/fuzzer/legion/runtime/realm/bytearray.inl:39
#19 0x00007f4aa4d0c62a in Realm::UserThreadTaskScheduler::execute_task (this=<optimized out>, task=<optimized out>)
at /scratch/eslaught/fuzzer/legion/runtime/realm/tasks.cc:1687
#20 0x00007f4aa4d0aab4 in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x55fa9d4e7fe0)
at /scratch/eslaught/fuzzer/legion/runtime/realm/tasks.cc:1158
To reproduce yourself, you'll have to run the parallel runner set to freeze:
salloc -n 1 -N 1 -c 40 -p cpu --exclusive
REALM_FREEZE_ON_ERROR=1 REALM_SYNTHETIC_CORE_MAP= ./runner.py --fuzzer=build/src/fuzzer -j40 -n100 -o1000 --extra='-ll:util 2 -ll:cpu 3' --launcher='srun -n 2 --overlap' -vv
Let this run for about 10 minutes. You can watch progress on htop. About 95% of the processes should eventually exit, leaving a set that are hung in execution. At this point you can pick a random process to attach to with a debugger.
Fuzzer version still the same at: https://github.com/StanfordLegion/fuzzer/commit/35b7448e945b831a84e7651bb12259e09b35028e
Legion version at: eceb1cd2ff26c55d17fc263a4ab020d4e0bdd4d4
Notes for myself: the partial_invalidations
and invalidated_updates
are identical.
(gdb) p this->partial_invalidations
$4 = {
<Legion::Internal::LegionHeapify<Legion::Internal::FieldMaskSet<Legion::Internal::IndexSpaceExpression, (Legion::Internal::AllocationType)104, false> >> = {<No data fields>},
members of Legion::Internal::FieldMaskSet<Legion::Internal::IndexSpaceExpression, (Legion::Internal::AllocationType)104, false>:
entries = {
single_entry = 0x7f4a1670cb70,
multi_entries = 0x7f4a1670cb70
},
valid_fields = {
<BitMaskHelp::Heapify<SSETLBitMask<512> >> = {<No data fields>},
members of SSETLBitMask<512>:
bits = {
bit_vector = {2, 0, 0, 0, 0, 0, 0, 0}
},
sum_mask = 2
},
single = true
}
(gdb) p invalidated_updates
$5 = (Legion::Internal::FieldMaskSet<Legion::Internal::IndexSpaceExpression, (Legion::Internal::AllocationType)104, false> &) @0x7f4a0c403430: {
<Legion::Internal::LegionHeapify<Legion::Internal::FieldMaskSet<Legion::Internal::IndexSpaceExpression, (Legion::Internal::AllocationType)104, false> >> = {<No data fields>},
members of Legion::Internal::FieldMaskSet<Legion::Internal::IndexSpaceExpression, (Legion::Internal::AllocationType)104, false>:
entries = {
single_entry = 0x7f4a1670cb70,
multi_entries = 0x7f4a1670cb70
},
valid_fields = {
<BitMaskHelp::Heapify<SSETLBitMask<512> >> = {<No data fields>},
members of SSETLBitMask<512>:
bits = {
bit_vector = {2, 0, 0, 0, 0, 0, 0, 0}
},
sum_mask = 2
},
single = true
}
Running the fuzzer on a multi-node, non-control replicated execution fails with:
Process here:
Backtrace:
Fuzzer version: https://github.com/StanfordLegion/fuzzer/commit/35b7448e945b831a84e7651bb12259e09b35028e
Legion: 299dd7d9f4a60fb8fa024d0252f2bd28c0a3f820