StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
657 stars 146 forks source link

Fuzzer: multi-node failure, non-control replicated #1674

Open elliottslaughter opened 2 months ago

elliottslaughter commented 2 months ago

Running the fuzzer on a multi-node, non-control replicated execution fails with:

fuzzer: /scratch/eslaught/fuzzer/legion/runtime/realm/runtime_impl.cc:2954: Realm::GenEventImpl* Realm::RuntimeImpl::get_genevent_impl(Realm::Event): Assertion `id.is_event()' failed.

Process here:

Process 741178 on node c0001.stanford.edu is frozen!

Backtrace:

#8  0x00007f3cf334cfd6 in __GI___assert_fail (assertion=assertion@entry=0x7f3cf41b4e4d "id.is_event()", file=file@entry=0x7f3cf41b5408 "/scratch/eslaught/fuzzer/legion/runtime/realm/runtime_impl.cc", line=line@entry=2954, 
    function=function@entry=0x7f3cf41b5fa8 "Realm::GenEventImpl* Realm::RuntimeImpl::get_genevent_impl(Realm::Event)") at assert.c:101
#9  0x00007f3cf3b4c84f in Realm::RuntimeImpl::get_genevent_impl (this=<optimized out>, e=..., e@entry=...) at /scratch/eslaught/fuzzer/legion/runtime/realm/utils.inl:288
#10 0x00007f3cf3aaad95 in Realm::get_genevent_impl (e=...) at /scratch/eslaught/fuzzer/legion/runtime/realm/runtime_impl.h:526
#11 Realm::GenEventImpl::trigger (poisoned=<optimized out>, e=...) at /scratch/eslaught/fuzzer/legion/runtime/realm/event_impl.inl:61
#12 Realm::UserEvent::trigger (this=this@entry=0x7f3c741fffa0, wait_on=<error reading variable: Cannot access memory at address 0x0>, ignore_faults=ignore_faults@entry=false) at /scratch/eslaught/fuzzer/legion/runtime/realm/event_impl.cc:408
#13 0x00007f3cf4ba7f7f in Legion::Internal::Runtime::trigger_event (precondition=..., to_trigger=...) at /scratch/eslaught/fuzzer/legion/runtime/legion/runtime.h:5436
#14 Legion::Internal::Operation::deactivate (this=this@entry=0x7f3c7677eb00, freeop=<optimized out>) at /scratch/eslaught/fuzzer/legion/runtime/legion/legion_ops.cc:1268
#15 0x00007f3cf4ba829d in Legion::Internal::MemoizableOp::deactivate (this=this@entry=0x7f3c7677eb00, freeop=<optimized out>) at /scratch/eslaught/fuzzer/legion/runtime/legion/legion_ops.cc:4361
#16 0x00007f3cf4ba82f6 in Legion::Internal::PredicatedOp::deactivate (this=this@entry=0x7f3c7677eb00, freeop=freeop@entry=false) at /scratch/eslaught/fuzzer/legion/runtime/legion/legion_ops.cc:4472
#17 0x00007f3cf4cbd951 in Legion::Internal::TaskOp::deactivate (this=this@entry=0x7f3c7677e950, freeop=freeop@entry=false) at /scratch/eslaught/fuzzer/legion/runtime/legion/legion_tasks.cc:367
#18 0x00007f3cf4cd034e in Legion::Internal::SingleTask::deactivate (this=this@entry=0x7f3c7677e950, freeop=freeop@entry=false) at /scratch/eslaught/fuzzer/legion/runtime/legion/legion_tasks.cc:2286
#19 0x00007f3cf4cd07a8 in Legion::Internal::IndividualTask::deactivate (this=0x7f3c7677e950, freeop=<optimized out>) at /scratch/eslaught/fuzzer/legion/runtime/legion/legion_tasks.cc:5944
#20 0x00007f3cf4ce8c56 in Legion::Internal::IndividualTask::unpack_task (this=0x7f3c7677e950, derez=..., current=..., ready_events=...) at /scratch/eslaught/fuzzer/legion/runtime/legion/legion_tasks.cc:6729
#21 0x00007f3cf4cc61d2 in Legion::Internal::TaskOp::process_unpack_task (rt=rt@entry=0x564e84ad4600, derez=...) at /scratch/eslaught/fuzzer/legion/runtime/legion/legion_tasks.cc:542
#22 0x00007f3cf4e2e0b9 in Legion::Internal::Runtime::handle_task (this=this@entry=0x564e84ad4600, derez=...) at /scratch/eslaught/fuzzer/legion/runtime/legion/runtime.cc:23924
#23 0x00007f3cf4ebc612 in Legion::Internal::VirtualChannel::handle_messages (this=0x7f3c74201660, num_messages=1, runtime=0x564e84ad4600, remote_address_space=1, args=0x7f3c680179b0 "", arglen=592) at /scratch/eslaught/fuzzer/legion/runtime/legion/runtime.cc:12152
#24 0x00007f3cf4ebc8a9 in Legion::Internal::VirtualChannel::process_message (this=0x7f3c74201660, args=0x7f3c68017994, arglen=612, runtime=0x564e84ad4600, remote_address_space=1) at /scratch/eslaught/fuzzer/legion/runtime/legion/runtime.cc:11989
#25 0x00007f3cf4ebd025 in Legion::Internal::Runtime::legion_runtime_task (args=0x7f3c68017980, arglen=628, userdata=<optimized out>, userlen=<optimized out>, p=...) at /scratch/eslaught/fuzzer/legion/runtime/legion/runtime.cc:32205
#26 0x00007f3cf3b2a325 in Realm::LocalTaskProcessor::execute_task (this=0x564e8364a010, func_id=4, task_args=...) at /scratch/eslaught/fuzzer/legion/runtime/realm/bytearray.inl:150
#27 0x00007f3cf3b7c0d3 in Realm::Task::execute_on_processor (this=0x7f3c68017700, p=...) at /scratch/eslaught/fuzzer/legion/runtime/realm/bytearray.inl:39
#28 0x00007f3cf3b7c18a in Realm::UserThreadTaskScheduler::execute_task (this=<optimized out>, task=<optimized out>) at /scratch/eslaught/fuzzer/legion/runtime/realm/tasks.cc:1687
#29 0x00007f3cf3b7a614 in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x564e8390a3a0) at /scratch/eslaught/fuzzer/legion/runtime/realm/tasks.cc:1158

Fuzzer version: https://github.com/StanfordLegion/fuzzer/commit/35b7448e945b831a84e7651bb12259e09b35028e

Legion: 299dd7d9f4a60fb8fa024d0252f2bd28c0a3f820

lightsighter commented 1 month ago

Can you try this again now that the inordercommit branch has merged? I think this line of code doesn't exist anymore.

elliottslaughter commented 1 month ago

Here's what I'm hitting now:

[1 - 7f7f0c144c40]    0.340909 {5}{runtime}: [error 61] LEGION ERROR: Invalid region handle (4c,2,2) for region requirement 0 of task void_leaf (ID 41) (from file /scratch/eslaught/fuzzer/legion/runtime/legion/legion_tasks.cc:1097)

Backtrace:

#7  0x00007f07b123c27a in Legion::Internal::Runtime::report_error_message (id=id@entry=61, file_name=file_name@entry=0x7f07b17ca460 "/scratch/eslaught/fuzzer/legion/runtime/legion/legion_tasks.cc", line=line@entry=1097, 
    message=message@entry=0x7f073290a4a0 "Invalid region handle (4c,2,2) for region requirement 0 of task void_leaf (ID 37)") at /scratch/eslaught/fuzzer/legion/runtime/legion/runtime.cc:32155
#8  0x00007f07b10c6125 in Legion::Internal::TaskOp::perform_privilege_checks (this=this@entry=0x7f073400f260) at /scratch/eslaught/fuzzer/legion/runtime/legion/legion_tasks.cc:1097
#9  0x00007f07b10e0367 in Legion::Internal::IndividualTask::perform_base_dependence_analysis (this=this@entry=0x7f073400f260) at /scratch/eslaught/fuzzer/legion/runtime/legion/legion_tasks.cc:6074
#10 0x00007f07b10e037d in Legion::Internal::IndividualTask::trigger_dependence_analysis (this=0x7f073400f260) at /scratch/eslaught/fuzzer/legion/runtime/legion/legion_tasks.cc:6064
#11 0x00007f07b0f98763 in Legion::Internal::Operation::execute_dependence_analysis (this=0x7f073400f410) at /scratch/eslaught/fuzzer/legion/runtime/legion/legion_ops.cc:1605
#12 0x00007f07b0eb1e9b in Legion::Internal::InnerContext::process_dependence_stage (this=0x7f0734010ec0) at /scratch/eslaught/fuzzer/legion/runtime/legion/legion_context.cc:8426
#13 0x00007f07b12c4878 in Legion::Internal::Runtime::legion_runtime_task (args=0x7f0730009520, arglen=12, userdata=<optimized out>, userlen=<optimized out>, p=...) at /scratch/eslaught/fuzzer/legion/runtime/legion/runtime.cc:32300
#14 0x00007f07aff47e95 in Realm::LocalTaskProcessor::execute_task (this=0x5584ac1cb710, func_id=4, task_args=...) at /scratch/eslaught/fuzzer/legion/runtime/realm/bytearray.inl:150
#15 0x00007f07aff99f23 in Realm::Task::execute_on_processor (this=0x7f07300093a0, p=...) at /scratch/eslaught/fuzzer/legion/runtime/realm/bytearray.inl:39

This is on Legion commit e20bb9f50c4c6715c39b4c749df076fa934e40b4

If you want to look at a process you can find it here:

Process 1260410 on node c0001.stanford.edu is frozen!
lightsighter commented 3 weeks ago

What commit of the fuzzer is this on, and what is the command line for reproducing it?

elliottslaughter commented 3 weeks ago

Fuzzer commit: https://github.com/StanfordLegion/fuzzer/commit/35b7448e945b831a84e7651bb12259e09b35028e (this is still the most recent version).

Command line for Sapling (assuming you want to run on a compute node):

salloc -n 1 -N 1 -c 40 -p cpu --exclusive
REALM_FREEZE_ON_ERROR=1 srun -n 2 build/src/fuzzer -fuzz:seed 0 -fuzz:ops 1000 -ll:util 2 -ll:cpu 3
elliottslaughter commented 3 weeks ago

I have a build on Sapling at /scratch/eslaught/fuzzer that you can feel free to use if you want. The Legion for this build is under the legion subdirectory, and the CMake build is in legion/build. You can rebuild Legion with:

cd /scratch/eslaught/fuzzer/legion/build
make install -j20
cd ../../build
make clean
make -j20
lightsighter commented 1 week ago

Try with https://gitlab.com/StanfordLegion/legion/-/merge_requests/1334

elliottslaughter commented 1 week ago

This error mode has gone away now. However, when I run a larger number of tests I still see some very low probability, non-deterministic failures. They do not reproduce when I rerun, so the only way to see this happen is to run the parallel test framework so that we can thrash the machine as hard as possible.

If you're able to look at this now, I've created a set of frozen processes you can look at: c0001 process 2135465, for example. (There are several frozen processes on the node, you can ps aux | grep fuzzer to find them.)

Sample backtrace in a frozen process (looks like gdb got a bit confused with some of the arguments, but hopefully still mostly readable):

#8  0x00007f4aa44d5fd6 in __GI___assert_fail (assertion=assertion@entry=0x7f4aa64de782 "partial_invalidations.empty()", 
    file=file@entry=0x7f4aa64cd408 "/scratch/eslaught/fuzzer/legion/runtime/legion/legion_analysis.cc", line=line@entry=21010, 
    function=function@entry=0x7f4aa64d9338 "void Legion::Internal::EquivalenceSet::apply_state(Legion::Internal::LegionMap<Legion::Internal::IndexSpaceExpression*, Legion::Internal::FieldMaskSet<Legion::Internal::LogicalView> >&, Legion::Intern"...) at assert.c:101
#9  0x00007f4aa5b51bc3 in Legion::Internal::EquivalenceSet::apply_state (this=<optimized out>, valid_updates=..., initialized_updates=..., 
    invalidated_updates=..., Python Exception <class 'AttributeError'> 'NoneType' object has no attribute 'pointer': 
reduction_updates=std::map with 0 elements, Python Exception <class 'AttributeError'> 'NoneType' object has no attribute 'pointer': 
restricted_updates=std::map with 0 elements, Python Exception <class 'AttributeError'> 'NoneType' object has no attribute 'pointer': 

    released_updates=std::map with 0 elements, precondition_updates=0x0, anticondition_updates=0x0, postcondition_updates=0x0, 
    read_only_guard_updates=0x7f4a0c4034a0, reduction_fill_guard_updates=0x7f4a0c403510, applied_events=std::vector of length 0, capacity 0, 
    needs_lock=false, forward_to_owner=false, unpack_references=true, filter_invalidations=false)
    at /scratch/eslaught/fuzzer/legion/runtime/legion/legion_utilities.h:3159
#10 0x00007f4aa5b51c9b in Legion::Internal::EquivalenceSet::apply_state (this=0x7f4a2ab61ca0, Python Exception <class 'AttributeError'> 'NoneType' object has no attribute 'pointer': 
valid_updates=std::map with 2 elements, 
    initialized_updates=..., invalidated_updates=..., Python Exception <class 'AttributeError'> 'NoneType' object has no attribute 'pointer': 
reduction_updates=std::map with 0 elements, Python Exception <class 'AttributeError'> 'NoneType' object has no attribute 'pointer': 
restricted_updates=std::map with 0 elements, 
    Python Exception <class 'AttributeError'> 'NoneType' object has no attribute 'pointer': 
released_updates=std::map with 0 elements, precondition_updates=0x0, anticondition_updates=0x0, postcondition_updates=0x0, 
    read_only_guard_updates=0x7f4a0c4034a0, reduction_fill_guard_updates=0x7f4a0c403510, applied_events=std::vector of length 0, capacity 0, 
    needs_lock=true, forward_to_owner=false, unpack_references=true, filter_invalidations=false)
    at /scratch/eslaught/fuzzer/legion/runtime/legion/legion_analysis.cc:20910
#11 0x00007f4aa5b62fab in Legion::Internal::EquivalenceSet::unpack_state_and_apply (this=0x7f4a2ab61ca0, derez=..., source=0, 
    applied_events=std::vector of length 0, capacity 0, forward_to_owner=<optimized out>, filter_invalidations=<optimized out>)
    at /scratch/eslaught/fuzzer/legion/runtime/legion/legion_analysis.cc:20147
#12 0x00007f4aa5b63da6 in Legion::Internal::EquivalenceSet::handle_migration (derez=..., runtime=runtime@entry=0x55fa9e6b2370, 
    source=source@entry=0) at /scratch/eslaught/fuzzer/legion/runtime/legion/legion_analysis.cc:19744
#13 0x00007f4aa5fb1832 in Legion::Internal::Runtime::handle_equivalence_set_migration (this=this@entry=0x55fa9e6b2370, derez=..., 
    source=source@entry=0) at /scratch/eslaught/fuzzer/legion/runtime/legion/runtime.cc:25814
#14 0x00007f4aa603b067 in Legion::Internal::VirtualChannel::handle_messages (this=0x7f4a2270d8a0, num_messages=1, runtime=0x55fa9e6b2370, 
    remote_address_space=0, args=0x7f4a18329730 "", arglen=600) at /scratch/eslaught/fuzzer/legion/runtime/legion/runtime.cc:13282
#15 0x00007f4aa603c391 in Legion::Internal::VirtualChannel::process_message (this=0x7f4a2270d8a0, args=0x7f4a18329714, arglen=620, 
    runtime=0x55fa9e6b2370, remote_address_space=0) at /scratch/eslaught/fuzzer/legion/runtime/legion/runtime.cc:12045
#16 0x00007f4aa603cb40 in Legion::Internal::Runtime::legion_runtime_task (args=0x7f4a18329700, arglen=636, userdata=<optimized out>, 
    userlen=<optimized out>, p=...) at /scratch/eslaught/fuzzer/legion/runtime/legion/runtime.cc:32345
#17 0x00007f4aa4cba555 in Realm::LocalTaskProcessor::execute_task (this=0x55fa9d4e7c10, func_id=4, task_args=...)
    at /scratch/eslaught/fuzzer/legion/runtime/realm/bytearray.inl:150
#18 0x00007f4aa4d0c573 in Realm::Task::execute_on_processor (this=0x7f4a184ac4a0, p=...)
    at /scratch/eslaught/fuzzer/legion/runtime/realm/bytearray.inl:39
#19 0x00007f4aa4d0c62a in Realm::UserThreadTaskScheduler::execute_task (this=<optimized out>, task=<optimized out>)
    at /scratch/eslaught/fuzzer/legion/runtime/realm/tasks.cc:1687
#20 0x00007f4aa4d0aab4 in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x55fa9d4e7fe0)
    at /scratch/eslaught/fuzzer/legion/runtime/realm/tasks.cc:1158

To reproduce yourself, you'll have to run the parallel runner set to freeze:

salloc -n 1 -N 1 -c 40 -p cpu --exclusive
REALM_FREEZE_ON_ERROR=1 REALM_SYNTHETIC_CORE_MAP= ./runner.py --fuzzer=build/src/fuzzer -j40 -n100 -o1000 --extra='-ll:util 2 -ll:cpu 3' --launcher='srun -n 2 --overlap' -vv

Let this run for about 10 minutes. You can watch progress on htop. About 95% of the processes should eventually exit, leaving a set that are hung in execution. At this point you can pick a random process to attach to with a debugger.

Fuzzer version still the same at: https://github.com/StanfordLegion/fuzzer/commit/35b7448e945b831a84e7651bb12259e09b35028e

Legion version at: eceb1cd2ff26c55d17fc263a4ab020d4e0bdd4d4

lightsighter commented 1 week ago

Notes for myself: the partial_invalidations and invalidated_updates are identical.

(gdb) p this->partial_invalidations
$4 = {
  <Legion::Internal::LegionHeapify<Legion::Internal::FieldMaskSet<Legion::Internal::IndexSpaceExpression, (Legion::Internal::AllocationType)104, false> >> = {<No data fields>}, 
  members of Legion::Internal::FieldMaskSet<Legion::Internal::IndexSpaceExpression, (Legion::Internal::AllocationType)104, false>:
  entries = {
    single_entry = 0x7f4a1670cb70,
    multi_entries = 0x7f4a1670cb70
  },
  valid_fields = {
    <BitMaskHelp::Heapify<SSETLBitMask<512> >> = {<No data fields>}, 
    members of SSETLBitMask<512>:
    bits = {
      bit_vector = {2, 0, 0, 0, 0, 0, 0, 0}
    },
    sum_mask = 2
  },
  single = true
}
(gdb) p invalidated_updates
$5 = (Legion::Internal::FieldMaskSet<Legion::Internal::IndexSpaceExpression, (Legion::Internal::AllocationType)104, false> &) @0x7f4a0c403430: {
  <Legion::Internal::LegionHeapify<Legion::Internal::FieldMaskSet<Legion::Internal::IndexSpaceExpression, (Legion::Internal::AllocationType)104, false> >> = {<No data fields>}, 
  members of Legion::Internal::FieldMaskSet<Legion::Internal::IndexSpaceExpression, (Legion::Internal::AllocationType)104, false>:
  entries = {
    single_entry = 0x7f4a1670cb70,
    multi_entries = 0x7f4a1670cb70
  },
  valid_fields = {
    <BitMaskHelp::Heapify<SSETLBitMask<512> >> = {<No data fields>}, 
    members of SSETLBitMask<512>:
    bits = {
      bit_vector = {2, 0, 0, 0, 0, 0, 0, 0}
    },
    sum_mask = 2
  },
  single = true
}