Closed elliottslaughter closed 7 months ago
The first actual bug in the physical analysis that's been found: https://gitlab.com/StanfordLegion/legion/-/merge_requests/1192
Note that adding -lg:inorder makes it stop reproducing, which is highly suspicious.
Not suspicious at all. Depends on a lot on how read-only regions traverse the equivalence set look-up tree and end up making new equivalence sets.
Confirmed the fix on the original reproducer.
I'm not sure if this is the same issue or a different one, but in a branch with both fixinvalidation
and fixvirtualinit
merged into together, I hit the same assert:
Assertion failed: ((lefts == NULL) || (mask * lefts->get_valid_mask())), function refine_node, file region_tree.inl, line 7304.
Command line:
./fuzzer -fuzz:seed 41 -fuzz:ops 11 -fuzz:skip 1 -level 4
Note this a newer commit of the fuzzer: https://github.com/StanfordLegion/fuzzer/commit/3057d03ee7e7284337ca22515bef9ba02ce98f45
I would note that it does take longer to find crashes now, and I had to build Legion DEBUG with -g -O2
and the application with -g -O2
in order to find them at any reasonable speed. Thus the backtrace may have some things optimized out:
* frame #4: 0x00000001029a2ad3 liblegion.1.dylib`Legion::Internal::EqKDNode<1, long long>::refine_node(Realm::Rect<1, long long> const&, SSETLBitMask<512u> const&, bool) (.cold.14) at region_tree.inl:7304:7 [opt]
frame #5: 0x000000010248f66e liblegion.1.dylib`Legion::Internal::EqKDNode<1, long long>::refine_node(this=<unavailable>, rect=0x0000700006bf4a30, mask=0x0000700006bf46f0, refine_current=<unavailable>) at region_tree.inl:7304:7 [opt]
frame #6: 0x0000000102485ae9 liblegion.1.dylib`Legion::Internal::EqKDNode<1, long long>::compute_equivalence_sets(this=0x00007f7eff908e80, rect=0x0000700006bf4a30, mask=0x0000700006bf4bd0, trackers=size=1, tracker_spaces=size=1, new_tracker_references=size=1, eq_sets=0x0000700006bf5250, pending_sets=size=0, new_subscriptions=0x0000700006bf52c0, to_create=0x0000700006bf5330, creation_rects=size=0, creation_srcs=size=0, remote_shard_rects=size=0, local_shard=0) at region_tree.inl:7124:17 [opt]
frame #7: 0x0000000102486b86 liblegion.1.dylib`Legion::Internal::EqKDNode<1, long long>::compute_equivalence_sets(this=0x00007f7eff905b30, rect=0x0000700006bf4dd0, mask=<unavailable>, trackers=size=1, tracker_spaces=size=1, new_tracker_references=size=1, eq_sets=0x0000700006bf5250, pending_sets=size=0, new_subscriptions=0x0000700006bf52c0, to_create=0x0000700006bf5330, creation_rects=size=0, creation_srcs=size=0, remote_shard_rects=size=0, local_shard=0) at region_tree.inl:7210:20 [opt]
frame #8: 0x0000000102486b86 liblegion.1.dylib`Legion::Internal::EqKDNode<1, long long>::compute_equivalence_sets(this=0x00007f7f4f008c00, rect=0x0000700006bf5068, mask=<unavailable>, trackers=size=1, tracker_spaces=size=1, new_tracker_references=size=1, eq_sets=0x0000700006bf5250, pending_sets=size=0, new_subscriptions=0x0000700006bf52c0, to_create=0x0000700006bf5330, creation_rects=size=0, creation_srcs=size=0, remote_shard_rects=size=0, local_shard=0) at region_tree.inl:7210:20 [opt]
frame #9: 0x000000010242838c liblegion.1.dylib`Legion::Internal::IndexSpaceNodeT<1, long long>::compute_equivalence_sets(this=<unavailable>, tree=<unavailable>, tree_lock=<unavailable>, mask=0x0000700006bf5540, trackers=size=1, tracker_spaces=size=1, new_tracker_references=size=1, eq_sets=0x0000700006bf5250, pending_sets=size=0, subscriptions=0x0000700006bf52c0, to_create=0x0000700006bf5330, creation_rects=size=0, creation_srcs=size=0, remote_shard_rects=size=0, local_shard=0) at region_tree.inl:5183:21 [opt]
frame #10: 0x0000000101edcd63 liblegion.1.dylib`Legion::Internal::InnerContext::compute_equivalence_sets(this=0x00007f7f43810800, req_index=<unavailable>, targets=size=1, target_spaces=size=1, creation_target_space=0, expr=0x00007f7f4f836200, mask=0x0000700006bf5540) at legion_context.cc:2224:13 [opt]
frame #11: 0x0000000101e7b476 liblegion.1.dylib`Legion::Internal::VersionManager::perform_versioning_analysis(this=0x00007f7f4f018190, context=0x00007f7f43810800, version_info=0x0000600003922980, region_node=0x00007f7efe818c00, version_mask=<unavailable>, op=0x00007f7efe823e00, index=0, parent_req_index=0, ready_events=size=1, output_region_ready=0x0000000000000000, collective_rendezvous=<unavailable>) at legion_analysis.cc:23832:28 [opt]
frame #12: 0x00000001022705cc liblegion.1.dylib`Legion::Internal::RegionTreeForest::perform_versioning_analysis(Legion::Internal::Operation*, unsigned int, Legion::RegionRequirement const&, Legion::Internal::VersionInfo&, std::__1::set<Legion::Internal::RtEvent, std::__1::less<Legion::Internal::RtEvent>, std::__1::allocator<Legion::Internal::RtEvent>>&, Legion::Internal::RtEvent*, bool) [inlined] Legion::Internal::RegionNode::perform_versioning_analysis(this=0x00007f7efe818c00, ctx=1, parent_ctx=0x00007f7f43810800, version_info=0x0000600003922980, mask=0x0000700006bf5630, op=0x00007f7efe823e00, index=0, parent_req_index=0, applied=size=1, output_region_ready=<unavailable>, collective_rendezvous=<unavailable>) at region_tree.cc:17354:15 [opt]
frame #13: 0x0000000102270570 liblegion.1.dylib`Legion::Internal::RegionTreeForest::perform_versioning_analysis(this=<unavailable>, op=0x00007f7efe823e00, index=0, req=<unavailable>, version_info=0x0000600003922980, ready_events=size=1, output_region_ready=0x0000000000000000, collective_rendezvous=<unavailable>) at region_tree.cc:1703:20 [opt]
frame #14: 0x000000010212ccaf liblegion.1.dylib`Legion::Internal::SingleTask::perform_versioning_analysis(this=0x00007f7efe823e00, post_mapper=false) at legion_tasks.cc:2683:28 [opt]
frame #15: 0x0000000102135796 liblegion.1.dylib`Legion::Internal::SingleTask::map_all_regions(this=0x00007f7efe823e00, must_epoch_op=0x0000000000000000, defer_args=0x0000000000000000) at legion_tasks.cc:4158:15 [opt]
frame #16: 0x0000000102146289 liblegion.1.dylib`Legion::Internal::PointTask::perform_mapping(this=0x00007f7efe823e00, must_epoch_owner=0x0000000000000000, args=0x0000000000000000) at legion_tasks.cc:7297:32 [opt]
frame #17: 0x000000010215ef41 liblegion.1.dylib`Legion::Internal::SliceTask::map_and_launch(this=0x00007f7efe816600) at legion_tasks.cc:11513:18 [opt]
frame #18: 0x000000010237fa62 liblegion.1.dylib`Legion::Internal::Runtime::legion_runtime_task(args=0x00007f7f4f01a3e8, arglen=<unavailable>, userdata=<unavailable>, userlen=<unavailable>, p=<unavailable>) at runtime.cc:32353:31 [opt]
frame #19: 0x00000001036df2f9 librealm.1.dylib`Realm::LocalTaskProcessor::execute_task(this=<unavailable>, func_id=4, task_args=0x0000700006bf5ca8) at proc_impl.cc:1176:5 [opt]
frame #20: 0x000000010371cc80 librealm.1.dylib`Realm::Task::execute_on_processor(this=0x00007f7f4f01a2b0, p=(id = 2089670227099910144)) at tasks.cc:326:40 [opt]
frame #21: 0x0000000103722f16 librealm.1.dylib`Realm::KernelThreadTaskScheduler::execute_task(this=<unavailable>, task=<unavailable>) at tasks.cc:1421:11 [opt]
frame #22: 0x0000000103721563 librealm.1.dylib`Realm::ThreadedTaskScheduler::scheduler_loop(this=0x00007f7f4e704800) at tasks.cc:1160:6 [opt]
Is this deterministic? I'm having trouble reproducing it.
Seems pretty deterministic on my Mac. Keep in mind I modified the runtime so that DEBUG
mode builds with -g -O2
(and the application with the same flags).
I actually got it in debug mode first try without any optimizations on in Linux.
Fortunately this is not a real bug. I'm forgetting to tighten a mask at some point, but that only results in a performance issue (which the assertion is flagging) and not a correctness issue.
Pushed it as another commit to the fix invalidation branch since it falls under the same category:
https://gitlab.com/StanfordLegion/legion/-/commit/e81a670057f854c89e0c2b40dd6c9f66b3e6389b
Yes, it goes away now.
Here's a new failure mode in the same function:
fuzzer: /scratch/eslaught/fuzzer/legion/runtime/legion/region_tree.inl:7307: void Legion::Internal::EqKDNode<DIM, T>::refine_node(Legion::Rect<DIM, T>&, const FieldMask&, bool) [with int DIM = 1; T = long long int; Legion::Rect<DIM, T> = Realm::Rect<1, long long int>; Legion::Internal::FieldMask = SSETLBitMask<512>]: Assertion `(child_previous_below == NULL) || (mask * child_previous_below->get_valid_mask())' failed.
Fuzzer version: https://github.com/StanfordLegion/fuzzer/commit/a0c55df5bca2ebc4bf1d380beacfe2fd0faaa903
Command line:
./fuzzer -fuzz:seed 5221 -fuzz:ops 5 -level 4
Running the fuzzer in debug mode I am no longer able to observe any failures with 10,000 tests on either my Mac or Sapling.
I spoke too soon. Using a longer trace length, I was able to produce a freeze. This is non-deterministic, so run it in a loop until it hangs. Each non-frozen run should take about 2 seconds (on Sapling).
i=0; while ./build/src/fuzzer -fuzz:seed 6108 -fuzz:ops 400 -ll:force_kthreads -level 4; do let i++; echo $i; done
There is a hung process right now on c0001
pid 1384742
.
Backtraces: bt.log
A couple minutes later: bt2.log
Fuzzer version: https://github.com/StanfordLegion/fuzzer/commit/5018ce38491868eb743b2d25f1292af9e3db2b3b
You can kill the process. I see what is going wrong.
I ran the original reproducer 400 times and it did not freeze.
I ran 10,000 tests × 100 ops with no errors and no freezes.
I ran 10,000 tests × 1,000 ops with no errors and no freezes.
I guess I'll keep fuzzing, but for now this seems pretty solid.
There's more dimensions still to test:
-ll:util > 1
but we can't do that until we fix #1666.I think running with -dm:exact_region
would at least generate copies. What it wouldn't do is generate instances as varying depth in the region tree, but we can look at that next.
Is there a workaround I can apply? Blocking fence, at least?
I think running with -dm:exact_region would at least generate copies. What it wouldn't do is generate instances as varying depth in the region tree, but we can look at that next.
I think we'll probably want more variability than that. Mixing precise and imprecise use of regions on the same instances.
Is there a workaround I can apply? Blocking fence, at least?
Putting a mapping fence after all the tasks have been issued, and then issuing all the deletions after the mapping fence should be sufficient to work around the issue for now.
Actually just issuing all the deletions at the end is probably sufficient I think.
Ok, I worked around the issue by just not deleting fields eagerly.
With this, and temporarily disabling inner tasks again (to avoid tripping #1668), I am unable to induce any failures with either:
./runner.py --fuzzer build/src/fuzzer -j4 -n1000 -o100 --extra="-ll:util 2"
./runner.py --fuzzer build/src/fuzzer -j4 -n1000 -o100 --extra="-ll:util 2 -dm:exact"
That's on my laptop. I'll try on Sapling in a bit here to be sure.
Probably time to bust out gcov and see what we're not hitting.
I hit a new failure on Sapling:
PID 2711882 is currently frozen on c0001.
Backtrace:
#7 0x00007fc71a173729 in __assert_fail_base (fmt=0x7fc71a309588 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x7fc71c0b9f67 "own_instance",
file=0x7fc71c0bdc20 "/scratch/eslaught/fuzzer/legion/runtime/legion/runtime.cc", line=3351, function=<optimized out>) at assert.c:92
#8 0x00007fc71a184fd6 in __GI___assert_fail (assertion=assertion@entry=0x7fc71c0b9f67 "own_instance", file=file@entry=0x7fc71c0bdc20 "/scratch/eslaught/fuzzer/legion/runtime/legion/runtime.cc",
line=line@entry=3351,
function=function@entry=0x7fc71c0ce6e0 "Legion::Internal::ApEvent Legion::Internal::FutureInstance::reduce_from(Legion::Internal::FutureInstance*, Legion::Internal::Operation*, Legion::ReductionOpID, const ReductionOp*, bool, Legion::Intern"...) at assert.c:101
#9 0x00007fc71bb6e8d2 in Legion::Internal::FutureInstance::reduce_from (this=0x7fc6e41dea00, source=source@entry=0x7fc6e4008a10, op=op@entry=0x7fc6e408b640, redop_id=1048644, redop=<optimized out>,
exclusive=exclusive@entry=false, precondition=...) at /scratch/eslaught/fuzzer/legion/runtime/legion/legion_types.h:2940
#10 0x00007fc71b993138 in Legion::Internal::MultiTask::fold_reduction_future (this=0x7fc6e408b490, instance=0x7fc6e4008a10, effects=...) at /usr/include/c++/9/bits/atomic_base.h:734
#11 0x00007fc71b9b9edb in Legion::Internal::IndexTask::reduce_future (this=0x7fc6e408b490, point=..., inst=0x7fc6e4008a10, effects=...)
at /scratch/eslaught/fuzzer/legion/runtime/legion/legion_tasks.cc:10260
#12 0x00007fc71b9ba0d1 in Legion::Internal::SliceTask::reduce_future (this=0x7fc6ec4aaa90, point=..., inst=0x7fc6e4008a10, effects=...)
at /scratch/eslaught/fuzzer/legion/runtime/legion/legion_tasks.cc:11838
#13 0x00007fc71b9a67ad in Legion::Internal::SliceTask::handle_future (this=0x7fc6ec4aaa90, effects=..., point=..., instance=instance@entry=0x7fc6e4008a10, metadata=metadata@entry=0x0, metasize=0,
functor=0x0, future_proc=..., own_functor=false) at /scratch/eslaught/fuzzer/legion/runtime/legion/legion_tasks.cc:11871
#14 0x00007fc71b9a6ae2 in Legion::Internal::PointTask::handle_post_execution (this=0x7fc6ec4cb650, instance=0x7fc6e4008a10, metadata=0x0, metasize=<optimized out>, functor=<optimized out>,
future_proc=..., own_functor=false) at /scratch/eslaught/fuzzer/legion/runtime/legion/legion_tasks.cc:7528
#15 0x00007fc71b76116a in Legion::Internal::LeafContext::post_end_task (this=0x7fc6ec2d9000, instance=<optimized out>, metadata=<optimized out>, metasize=<optimized out>,
callback_functor=<optimized out>, own_callback_functor=<optimized out>) at /scratch/eslaught/fuzzer/legion/runtime/legion/legion_context.cc:25106
#16 0x00007fc71b7de30f in Legion::Internal::InnerContext::process_post_end_tasks (this=<optimized out>) at /scratch/eslaught/fuzzer/legion/runtime/legion/legion_context.cc:9130
#17 0x00007fc71b7de5d1 in Legion::Internal::InnerContext::handle_post_end_task (args=args@entry=0x7fc6ec2d0530) at /scratch/eslaught/fuzzer/legion/runtime/legion/legion_context.cc:12425
#18 0x00007fc71bb94af4 in Legion::Internal::Runtime::legion_runtime_task (args=0x7fc6ec2d0530, arglen=12, userdata=<optimized out>, userlen=<optimized out>, p=...)
at /scratch/eslaught/fuzzer/legion/runtime/legion/runtime.cc:32287
#19 0x00007fc71a90a955 in Realm::LocalTaskProcessor::execute_task (this=0x556a8ddbbbd0, func_id=4, task_args=...) at /scratch/eslaught/fuzzer/legion/runtime/realm/bytearray.inl:150
#20 0x00007fc71a95c683 in Realm::Task::execute_on_processor (this=0x7fc6ec2d03b0, p=...) at /scratch/eslaught/fuzzer/legion/runtime/realm/bytearray.inl:39
#21 0x00007fc71a95c73a in Realm::UserThreadTaskScheduler::execute_task (this=<optimized out>, task=<optimized out>) at /scratch/eslaught/fuzzer/legion/runtime/realm/tasks.cc:1687
#22 0x00007fc71a95abc4 in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x556a8ddbbf60) at /scratch/eslaught/fuzzer/legion/runtime/realm/tasks.cc:1158
#23 0x00007fc71a963e9a in Realm::UserThread::uthread_entry () at /scratch/eslaught/fuzzer/legion/runtime/realm/threads.cc:1405
Command line:
build/src/fuzzer -fuzz:seed 5812 -fuzz:ops 1000 -ll:util 2 -dm:exact_region -level 4
Fuzzer version: https://github.com/StanfordLegion/fuzzer/commit/679604e2e67ebe311aacbbae298d19234d9e35f0
Fortunately I already know what that one is. It is an overzealous assertion. I updated it in copy_from
but forgot to do the same in reduce_from
https://gitlab.com/StanfordLegion/legion/-/merge_requests/1196
It's much less frequent now but I still hit:
#7 0x00007f242a41e729 in __assert_fail_base (fmt=0x7f242a5b4588 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n",
assertion=0x7f242c3797b8 "own_instance || (own_dst && external_allocation && (redop->sizeof_rhs < size))", file=0x7f242c368c20 "/scratch/eslaught/fuzzer/legion/runtime/legion/runtime.cc",
line=3354, function=<optimized out>) at assert.c:92
#8 0x00007f242a42ffd6 in __GI___assert_fail (assertion=assertion@entry=0x7f242c3797b8 "own_instance || (own_dst && external_allocation && (redop->sizeof_rhs < size))",
file=file@entry=0x7f242c368c20 "/scratch/eslaught/fuzzer/legion/runtime/legion/runtime.cc", line=line@entry=3354,
function=function@entry=0x7f242c3796e0 "Legion::Internal::ApEvent Legion::Internal::FutureInstance::reduce_from(Legion::Internal::FutureInstance*, Legion::Internal::Operation*, Legion::ReductionOpID, const ReductionOp*, bool, Legion::Intern"...) at assert.c:101
#9 0x00007f242be198e0 in Legion::Internal::FutureInstance::reduce_from (this=0x7f23f82dbcd0, source=source@entry=0x7f23f82dbdf0, op=op@entry=0x7f23f836a7f0, redop_id=1048584, redop=0x5590070b1390,
exclusive=exclusive@entry=false, precondition=...) at /scratch/eslaught/fuzzer/legion/runtime/legion/runtime.cc:3408
#10 0x00007f242bc3e158 in Legion::Internal::MultiTask::fold_reduction_future (this=0x7f23f836a640, instance=0x7f23f82dbdf0, effects=...) at /usr/include/c++/9/bits/atomic_base.h:734
#11 0x00007f242bc64efb in Legion::Internal::IndexTask::reduce_future (this=0x7f23f836a640, point=..., inst=0x7f23f82dbdf0, effects=...)
at /scratch/eslaught/fuzzer/legion/runtime/legion/legion_tasks.cc:10260
#12 0x00007f242bc650f1 in Legion::Internal::SliceTask::reduce_future (this=0x7f23fc060430, point=..., inst=0x7f23f82dbdf0, effects=...)
at /scratch/eslaught/fuzzer/legion/runtime/legion/legion_tasks.cc:11838
#13 0x00007f242bc517cd in Legion::Internal::SliceTask::handle_future (this=0x7f23fc060430, effects=..., point=..., instance=instance@entry=0x7f23f82dbdf0, metadata=metadata@entry=0x0, metasize=0,
functor=0x0, future_proc=..., own_functor=false) at /scratch/eslaught/fuzzer/legion/runtime/legion/legion_tasks.cc:11871
#14 0x00007f242bc51b02 in Legion::Internal::PointTask::handle_post_execution (this=0x7f23e8a77860, instance=0x7f23f82dbdf0, metadata=0x0, metasize=<optimized out>, functor=<optimized out>,
future_proc=..., own_functor=false) at /scratch/eslaught/fuzzer/legion/runtime/legion/legion_tasks.cc:7528
#15 0x00007f242ba0c18a in Legion::Internal::LeafContext::post_end_task (this=0x7f23fce6db70, instance=<optimized out>, metadata=<optimized out>, metasize=<optimized out>,
callback_functor=<optimized out>, own_callback_functor=<optimized out>) at /scratch/eslaught/fuzzer/legion/runtime/legion/legion_context.cc:25106
#16 0x00007f242ba8932f in Legion::Internal::InnerContext::process_post_end_tasks (this=<optimized out>) at /scratch/eslaught/fuzzer/legion/runtime/legion/legion_context.cc:9130
#17 0x00007f242ba895f1 in Legion::Internal::InnerContext::handle_post_end_task (args=args@entry=0x7f23fcf6d1e0) at /scratch/eslaught/fuzzer/legion/runtime/legion/legion_context.cc:12425
#18 0x00007f242be3fae4 in Legion::Internal::Runtime::legion_runtime_task (args=0x7f23fcf6d1e0, arglen=12, userdata=<optimized out>, userlen=<optimized out>, p=...)
at /scratch/eslaught/fuzzer/legion/runtime/legion/runtime.cc:32291
#19 0x00007f242abb5955 in Realm::LocalTaskProcessor::execute_task (this=0x5590070ba940, func_id=4, task_args=...) at /scratch/eslaught/fuzzer/legion/runtime/realm/bytearray.inl:150
#20 0x00007f242ac07683 in Realm::Task::execute_on_processor (this=0x7f23fcf6d060, p=...) at /scratch/eslaught/fuzzer/legion/runtime/realm/bytearray.inl:39
You can take a look at process 3096943 on c0001.
Command line (note I had to run 100,000 tests to hit this):
build/src/fuzzer -fuzz:seed 44554 -fuzz:ops 1000 -ll:util 2 -ll:cpu 3 -dm:exact_region -level 4
I'm still on a locally modified version of https://github.com/StanfordLegion/fuzzer/commit/679604e2e67ebe311aacbbae298d19234d9e35f0 that disabled inner tasks (and thus virtual mappings).
Pull the fixfuturereduce
branch and try again.
I finished the following without any errors:
./runner.py --fuzzer=build/src/fuzzer -j40 -n100000 -o1000 --extra='-ll:util 2 -ll:cpu 3 -dm:exact_region'
I'm about half-way through running 1M tests in the configuration above. No failures so far. I'll let it run overnight to be sure, but I think we got this one.
Completed the following without failures:
./runner.py --fuzzer=build/src/fuzzer -j40 -n1000000 -o1000 -s100000 --extra='-ll:util 2 -ll:cpu 3 -dm:exact_region'
I think this is different from the bug in #1659, but let me know if it is a duplicate.
Now, there should be no more virtual mappings anywhere in the fuzzer.
I am hitting this assert:
When reproducing, use this commit of the fuzzer: https://github.com/StanfordLegion/fuzzer/commit/3d77d6c97241517056e1e05bb0ab46ff41a82c1e
Command to reproduce:
Note that adding
-lg:inorder
makes it stop reproducing, which is highly suspicious.Backtrace: