StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
657 stars 146 forks source link

Fuzzer assertion in refine_node (now with concrete mappings) #1663

Closed elliottslaughter closed 3 months ago

elliottslaughter commented 3 months ago

I think this is different from the bug in #1659, but let me know if it is a duplicate.

Now, there should be no more virtual mappings anywhere in the fuzzer.

I am hitting this assert:

Assertion failed: ((lefts == NULL) || (mask * lefts->get_valid_mask())), function refine_node, file region_tree.inl, line 7304.

When reproducing, use this commit of the fuzzer: https://github.com/StanfordLegion/fuzzer/commit/3d77d6c97241517056e1e05bb0ab46ff41a82c1e

Command to reproduce:

./fuzzer -fuzz:seed 33 -fuzz:ops 5 -level 4,fuzz=2

Note that adding -lg:inorder makes it stop reproducing, which is highly suspicious.

Backtrace:

* thread #4, stop reason = hit program assert
    frame #0: 0x00007ff8066d57a2 libsystem_kernel.dylib`__pthread_kill + 10
    frame #1: 0x00007ff80670df30 libsystem_pthread.dylib`pthread_kill + 262
    frame #2: 0x00007ff80662ca49 libsystem_c.dylib`abort + 126
    frame #3: 0x00007ff80662bd30 libsystem_c.dylib`__assert_rtn + 314
  * frame #4: 0x0000000105eeeec6 liblegion.1.dylib`Legion::Internal::EqKDNode<1, long long>::refine_node(this=0x00007f8d46d0d250, rect=0x000070000a331928, mask=0x000070000a331530, refine_current=true) at region_tree.inl:7304:7
    frame #5: 0x0000000105ee66a0 liblegion.1.dylib`Legion::Internal::EqKDNode<1, long long>::invalidate_tree(this=0x00007f8d46d0d250, rect=0x000070000a331928, mask=0x000070000a331ea0, runtime=0x00007f8d82014400, invalidated=size=0, move_to_previous=true, parent_all_previous=0x000070000a331e50) at region_tree.inl:8404:13
    frame #6: 0x0000000105ee721c liblegion.1.dylib`Legion::Internal::EqKDNode<1, long long>::invalidate_tree(this=0x00007f8d46d0cb50, rect=0x000070000a332518, mask=0x000070000a332a90, runtime=0x00007f8d82014400, invalidated=size=0, move_to_previous=true, parent_all_previous=0x000070000a332a40) at region_tree.inl:8475:26
    frame #7: 0x0000000105ee721c liblegion.1.dylib`Legion::Internal::EqKDNode<1, long long>::invalidate_tree(this=0x00007f8d46e04500, rect=0x000070000a332ec0, mask=0x00007f8d41f07d60, runtime=0x00007f8d82014400, invalidated=size=0, move_to_previous=true, parent_all_previous=0x0000000000000000) at region_tree.inl:8475:26
    frame #8: 0x0000000105e11c10 liblegion.1.dylib`Legion::Internal::IndexSpaceNodeT<1, long long>::invalidate_equivalence_set_kd_tree(this=0x00007f8d4701ae00, tree=0x00007f8d46e04500, tree_lock=0x0000600000a0c000, mask=0x00007f8d41f07d60, invalidated=size=0, move_to_previous=true) at region_tree.inl:5227:21
    frame #9: 0x0000000105377ca4 liblegion.1.dylib`Legion::Internal::InnerContext::refine_equivalence_sets(this=0x00007f8d8201c400, req_index=0, node=0x00007f8d4701ae00, refinement_mask=0x00007f8d41f07d60, applied_events=size=0, sharded=false) at legion_context.cc:3052:13
    frame #10: 0x00000001055b95db liblegion.1.dylib`Legion::Internal::RefinementOp::trigger_mapping(this=0x00007f8d41f07b30) at legion_ops.cc:12094:21
    frame #11: 0x0000000105ba7456 liblegion.1.dylib`Legion::Internal::Runtime::legion_runtime_task(args=0x00007f8d81f85458, arglen=12, userdata=0x0000600003400050, userlen=8, p=(id = 2089670227099910144)) at runtime.cc:32298:31
    frame #12: 0x0000000109e5b158 librealm.1.dylib`Realm::LocalTaskProcessor::execute_task(this=0x00007f8d81f78e30, func_id=4, task_args=0x000070000a3338e0) at proc_impl.cc:1176:5
    frame #13: 0x0000000109f01fd8 librealm.1.dylib`Realm::Task::execute_on_processor(this=0x00007f8d81f85320, p=(id = 2089670227099910144)) at tasks.cc:326:40
    frame #14: 0x0000000109f08ccc librealm.1.dylib`Realm::KernelThreadTaskScheduler::execute_task(this=0x00007f8d81f78550, task=0x00007f8d81f85320) at tasks.cc:1421:11
    frame #15: 0x0000000109f07355 librealm.1.dylib`Realm::ThreadedTaskScheduler::scheduler_loop(this=0x00007f8d81f78550) at tasks.cc:1160:6
    frame #16: 0x0000000109f07d5a librealm.1.dylib`Realm::ThreadedTaskScheduler::scheduler_loop_wlock(this=0x00007f8d81f78550) at tasks.cc:1272:5
    frame #17: 0x0000000109f1cb6c librealm.1.dylib`void Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop_wlock()>(obj=0x00007f8d81f78550) at threads.inl:97:5
lightsighter commented 3 months ago

The first actual bug in the physical analysis that's been found: https://gitlab.com/StanfordLegion/legion/-/merge_requests/1192

lightsighter commented 3 months ago

Note that adding -lg:inorder makes it stop reproducing, which is highly suspicious.

Not suspicious at all. Depends on a lot on how read-only regions traverse the equivalence set look-up tree and end up making new equivalence sets.

elliottslaughter commented 3 months ago

Confirmed the fix on the original reproducer.

elliottslaughter commented 3 months ago

I'm not sure if this is the same issue or a different one, but in a branch with both fixinvalidation and fixvirtualinit merged into together, I hit the same assert:

Assertion failed: ((lefts == NULL) || (mask * lefts->get_valid_mask())), function refine_node, file region_tree.inl, line 7304.

Command line:

./fuzzer -fuzz:seed 41 -fuzz:ops 11 -fuzz:skip 1 -level 4

Note this a newer commit of the fuzzer: https://github.com/StanfordLegion/fuzzer/commit/3057d03ee7e7284337ca22515bef9ba02ce98f45

I would note that it does take longer to find crashes now, and I had to build Legion DEBUG with -g -O2 and the application with -g -O2 in order to find them at any reasonable speed. Thus the backtrace may have some things optimized out:

  * frame #4: 0x00000001029a2ad3 liblegion.1.dylib`Legion::Internal::EqKDNode<1, long long>::refine_node(Realm::Rect<1, long long> const&, SSETLBitMask<512u> const&, bool) (.cold.14) at region_tree.inl:7304:7 [opt]
    frame #5: 0x000000010248f66e liblegion.1.dylib`Legion::Internal::EqKDNode<1, long long>::refine_node(this=<unavailable>, rect=0x0000700006bf4a30, mask=0x0000700006bf46f0, refine_current=<unavailable>) at region_tree.inl:7304:7 [opt]
    frame #6: 0x0000000102485ae9 liblegion.1.dylib`Legion::Internal::EqKDNode<1, long long>::compute_equivalence_sets(this=0x00007f7eff908e80, rect=0x0000700006bf4a30, mask=0x0000700006bf4bd0, trackers=size=1, tracker_spaces=size=1, new_tracker_references=size=1, eq_sets=0x0000700006bf5250, pending_sets=size=0, new_subscriptions=0x0000700006bf52c0, to_create=0x0000700006bf5330, creation_rects=size=0, creation_srcs=size=0, remote_shard_rects=size=0, local_shard=0) at region_tree.inl:7124:17 [opt]
    frame #7: 0x0000000102486b86 liblegion.1.dylib`Legion::Internal::EqKDNode<1, long long>::compute_equivalence_sets(this=0x00007f7eff905b30, rect=0x0000700006bf4dd0, mask=<unavailable>, trackers=size=1, tracker_spaces=size=1, new_tracker_references=size=1, eq_sets=0x0000700006bf5250, pending_sets=size=0, new_subscriptions=0x0000700006bf52c0, to_create=0x0000700006bf5330, creation_rects=size=0, creation_srcs=size=0, remote_shard_rects=size=0, local_shard=0) at region_tree.inl:7210:20 [opt]
    frame #8: 0x0000000102486b86 liblegion.1.dylib`Legion::Internal::EqKDNode<1, long long>::compute_equivalence_sets(this=0x00007f7f4f008c00, rect=0x0000700006bf5068, mask=<unavailable>, trackers=size=1, tracker_spaces=size=1, new_tracker_references=size=1, eq_sets=0x0000700006bf5250, pending_sets=size=0, new_subscriptions=0x0000700006bf52c0, to_create=0x0000700006bf5330, creation_rects=size=0, creation_srcs=size=0, remote_shard_rects=size=0, local_shard=0) at region_tree.inl:7210:20 [opt]
    frame #9: 0x000000010242838c liblegion.1.dylib`Legion::Internal::IndexSpaceNodeT<1, long long>::compute_equivalence_sets(this=<unavailable>, tree=<unavailable>, tree_lock=<unavailable>, mask=0x0000700006bf5540, trackers=size=1, tracker_spaces=size=1, new_tracker_references=size=1, eq_sets=0x0000700006bf5250, pending_sets=size=0, subscriptions=0x0000700006bf52c0, to_create=0x0000700006bf5330, creation_rects=size=0, creation_srcs=size=0, remote_shard_rects=size=0, local_shard=0) at region_tree.inl:5183:21 [opt]
    frame #10: 0x0000000101edcd63 liblegion.1.dylib`Legion::Internal::InnerContext::compute_equivalence_sets(this=0x00007f7f43810800, req_index=<unavailable>, targets=size=1, target_spaces=size=1, creation_target_space=0, expr=0x00007f7f4f836200, mask=0x0000700006bf5540) at legion_context.cc:2224:13 [opt]
    frame #11: 0x0000000101e7b476 liblegion.1.dylib`Legion::Internal::VersionManager::perform_versioning_analysis(this=0x00007f7f4f018190, context=0x00007f7f43810800, version_info=0x0000600003922980, region_node=0x00007f7efe818c00, version_mask=<unavailable>, op=0x00007f7efe823e00, index=0, parent_req_index=0, ready_events=size=1, output_region_ready=0x0000000000000000, collective_rendezvous=<unavailable>) at legion_analysis.cc:23832:28 [opt]
    frame #12: 0x00000001022705cc liblegion.1.dylib`Legion::Internal::RegionTreeForest::perform_versioning_analysis(Legion::Internal::Operation*, unsigned int, Legion::RegionRequirement const&, Legion::Internal::VersionInfo&, std::__1::set<Legion::Internal::RtEvent, std::__1::less<Legion::Internal::RtEvent>, std::__1::allocator<Legion::Internal::RtEvent>>&, Legion::Internal::RtEvent*, bool) [inlined] Legion::Internal::RegionNode::perform_versioning_analysis(this=0x00007f7efe818c00, ctx=1, parent_ctx=0x00007f7f43810800, version_info=0x0000600003922980, mask=0x0000700006bf5630, op=0x00007f7efe823e00, index=0, parent_req_index=0, applied=size=1, output_region_ready=<unavailable>, collective_rendezvous=<unavailable>) at region_tree.cc:17354:15 [opt]
    frame #13: 0x0000000102270570 liblegion.1.dylib`Legion::Internal::RegionTreeForest::perform_versioning_analysis(this=<unavailable>, op=0x00007f7efe823e00, index=0, req=<unavailable>, version_info=0x0000600003922980, ready_events=size=1, output_region_ready=0x0000000000000000, collective_rendezvous=<unavailable>) at region_tree.cc:1703:20 [opt]
    frame #14: 0x000000010212ccaf liblegion.1.dylib`Legion::Internal::SingleTask::perform_versioning_analysis(this=0x00007f7efe823e00, post_mapper=false) at legion_tasks.cc:2683:28 [opt]
    frame #15: 0x0000000102135796 liblegion.1.dylib`Legion::Internal::SingleTask::map_all_regions(this=0x00007f7efe823e00, must_epoch_op=0x0000000000000000, defer_args=0x0000000000000000) at legion_tasks.cc:4158:15 [opt]
    frame #16: 0x0000000102146289 liblegion.1.dylib`Legion::Internal::PointTask::perform_mapping(this=0x00007f7efe823e00, must_epoch_owner=0x0000000000000000, args=0x0000000000000000) at legion_tasks.cc:7297:32 [opt]
    frame #17: 0x000000010215ef41 liblegion.1.dylib`Legion::Internal::SliceTask::map_and_launch(this=0x00007f7efe816600) at legion_tasks.cc:11513:18 [opt]
    frame #18: 0x000000010237fa62 liblegion.1.dylib`Legion::Internal::Runtime::legion_runtime_task(args=0x00007f7f4f01a3e8, arglen=<unavailable>, userdata=<unavailable>, userlen=<unavailable>, p=<unavailable>) at runtime.cc:32353:31 [opt]
    frame #19: 0x00000001036df2f9 librealm.1.dylib`Realm::LocalTaskProcessor::execute_task(this=<unavailable>, func_id=4, task_args=0x0000700006bf5ca8) at proc_impl.cc:1176:5 [opt]
    frame #20: 0x000000010371cc80 librealm.1.dylib`Realm::Task::execute_on_processor(this=0x00007f7f4f01a2b0, p=(id = 2089670227099910144)) at tasks.cc:326:40 [opt]
    frame #21: 0x0000000103722f16 librealm.1.dylib`Realm::KernelThreadTaskScheduler::execute_task(this=<unavailable>, task=<unavailable>) at tasks.cc:1421:11 [opt]
    frame #22: 0x0000000103721563 librealm.1.dylib`Realm::ThreadedTaskScheduler::scheduler_loop(this=0x00007f7f4e704800) at tasks.cc:1160:6 [opt]
lightsighter commented 3 months ago

Is this deterministic? I'm having trouble reproducing it.

elliottslaughter commented 3 months ago

Seems pretty deterministic on my Mac. Keep in mind I modified the runtime so that DEBUG mode builds with -g -O2 (and the application with the same flags).

lightsighter commented 3 months ago

I actually got it in debug mode first try without any optimizations on in Linux.

lightsighter commented 3 months ago

Fortunately this is not a real bug. I'm forgetting to tighten a mask at some point, but that only results in a performance issue (which the assertion is flagging) and not a correctness issue.

lightsighter commented 3 months ago

Pushed it as another commit to the fix invalidation branch since it falls under the same category:

https://gitlab.com/StanfordLegion/legion/-/commit/e81a670057f854c89e0c2b40dd6c9f66b3e6389b

elliottslaughter commented 3 months ago

Yes, it goes away now.

elliottslaughter commented 3 months ago

Here's a new failure mode in the same function:

fuzzer: /scratch/eslaught/fuzzer/legion/runtime/legion/region_tree.inl:7307: void Legion::Internal::EqKDNode<DIM, T>::refine_node(Legion::Rect<DIM, T>&, const FieldMask&, bool) [with int DIM = 1; T = long long int; Legion::Rect<DIM, T> = Realm::Rect<1, long long int>; Legion::Internal::FieldMask = SSETLBitMask<512>]: Assertion `(child_previous_below == NULL) || (mask * child_previous_below->get_valid_mask())' failed.

Fuzzer version: https://github.com/StanfordLegion/fuzzer/commit/a0c55df5bca2ebc4bf1d380beacfe2fd0faaa903

Command line:

./fuzzer -fuzz:seed 5221 -fuzz:ops 5 -level 4
lightsighter commented 3 months ago

Fixed with: https://gitlab.com/StanfordLegion/legion/-/commit/0b0bd05649a51ca36601f342919c949540cf22ac

elliottslaughter commented 3 months ago

Running the fuzzer in debug mode I am no longer able to observe any failures with 10,000 tests on either my Mac or Sapling.

elliottslaughter commented 3 months ago

I spoke too soon. Using a longer trace length, I was able to produce a freeze. This is non-deterministic, so run it in a loop until it hangs. Each non-frozen run should take about 2 seconds (on Sapling).

i=0; while ./build/src/fuzzer -fuzz:seed 6108 -fuzz:ops 400 -ll:force_kthreads -level 4; do let i++; echo $i; done

There is a hung process right now on c0001 pid 1384742.

Backtraces: bt.log

A couple minutes later: bt2.log

Fuzzer version: https://github.com/StanfordLegion/fuzzer/commit/5018ce38491868eb743b2d25f1292af9e3db2b3b

lightsighter commented 3 months ago

You can kill the process. I see what is going wrong.

lightsighter commented 3 months ago

Should be fixed with: https://gitlab.com/StanfordLegion/legion/-/commit/86912ba8e254ce9795269465156c65579a647ca8

elliottslaughter commented 3 months ago

I ran the original reproducer 400 times and it did not freeze.

I ran 10,000 tests × 100 ops with no errors and no freezes.

I ran 10,000 tests × 1,000 ops with no errors and no freezes.

I guess I'll keep fuzzing, but for now this seems pretty solid.

lightsighter commented 3 months ago

There's more dimensions still to test:

  1. The way the default mapper works we're really only testing the equivalence set discovery code and the dependence analysis on logical views here. We haven't really exercised any of the equivalence set logic at all really if we're not issuing lots of copies. To exercise that you're going to need to have the fuzzer also be able to fuzz mapping decisions and pick different instances for different tasks.
  2. We need to turn on -ll:util > 1 but we can't do that until we fix #1666.
elliottslaughter commented 3 months ago
  1. I think running with -dm:exact_region would at least generate copies. What it wouldn't do is generate instances as varying depth in the region tree, but we can look at that next.

  2. Is there a workaround I can apply? Blocking fence, at least?

lightsighter commented 3 months ago

I think running with -dm:exact_region would at least generate copies. What it wouldn't do is generate instances as varying depth in the region tree, but we can look at that next.

I think we'll probably want more variability than that. Mixing precise and imprecise use of regions on the same instances.

Is there a workaround I can apply? Blocking fence, at least?

Putting a mapping fence after all the tasks have been issued, and then issuing all the deletions after the mapping fence should be sufficient to work around the issue for now.

lightsighter commented 3 months ago

Actually just issuing all the deletions at the end is probably sufficient I think.

elliottslaughter commented 3 months ago

Ok, I worked around the issue by just not deleting fields eagerly.

With this, and temporarily disabling inner tasks again (to avoid tripping #1668), I am unable to induce any failures with either:

./runner.py --fuzzer build/src/fuzzer -j4 -n1000 -o100 --extra="-ll:util 2"
./runner.py --fuzzer build/src/fuzzer -j4 -n1000 -o100 --extra="-ll:util 2 -dm:exact"

That's on my laptop. I'll try on Sapling in a bit here to be sure.

lightsighter commented 3 months ago

Probably time to bust out gcov and see what we're not hitting.

elliottslaughter commented 3 months ago

I hit a new failure on Sapling:

PID 2711882 is currently frozen on c0001.

Backtrace:

#7  0x00007fc71a173729 in __assert_fail_base (fmt=0x7fc71a309588 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x7fc71c0b9f67 "own_instance", 
    file=0x7fc71c0bdc20 "/scratch/eslaught/fuzzer/legion/runtime/legion/runtime.cc", line=3351, function=<optimized out>) at assert.c:92
#8  0x00007fc71a184fd6 in __GI___assert_fail (assertion=assertion@entry=0x7fc71c0b9f67 "own_instance", file=file@entry=0x7fc71c0bdc20 "/scratch/eslaught/fuzzer/legion/runtime/legion/runtime.cc", 
    line=line@entry=3351, 
    function=function@entry=0x7fc71c0ce6e0 "Legion::Internal::ApEvent Legion::Internal::FutureInstance::reduce_from(Legion::Internal::FutureInstance*, Legion::Internal::Operation*, Legion::ReductionOpID, const ReductionOp*, bool, Legion::Intern"...) at assert.c:101
#9  0x00007fc71bb6e8d2 in Legion::Internal::FutureInstance::reduce_from (this=0x7fc6e41dea00, source=source@entry=0x7fc6e4008a10, op=op@entry=0x7fc6e408b640, redop_id=1048644, redop=<optimized out>, 
    exclusive=exclusive@entry=false, precondition=...) at /scratch/eslaught/fuzzer/legion/runtime/legion/legion_types.h:2940
#10 0x00007fc71b993138 in Legion::Internal::MultiTask::fold_reduction_future (this=0x7fc6e408b490, instance=0x7fc6e4008a10, effects=...) at /usr/include/c++/9/bits/atomic_base.h:734
#11 0x00007fc71b9b9edb in Legion::Internal::IndexTask::reduce_future (this=0x7fc6e408b490, point=..., inst=0x7fc6e4008a10, effects=...)
    at /scratch/eslaught/fuzzer/legion/runtime/legion/legion_tasks.cc:10260
#12 0x00007fc71b9ba0d1 in Legion::Internal::SliceTask::reduce_future (this=0x7fc6ec4aaa90, point=..., inst=0x7fc6e4008a10, effects=...)
    at /scratch/eslaught/fuzzer/legion/runtime/legion/legion_tasks.cc:11838
#13 0x00007fc71b9a67ad in Legion::Internal::SliceTask::handle_future (this=0x7fc6ec4aaa90, effects=..., point=..., instance=instance@entry=0x7fc6e4008a10, metadata=metadata@entry=0x0, metasize=0, 
    functor=0x0, future_proc=..., own_functor=false) at /scratch/eslaught/fuzzer/legion/runtime/legion/legion_tasks.cc:11871
#14 0x00007fc71b9a6ae2 in Legion::Internal::PointTask::handle_post_execution (this=0x7fc6ec4cb650, instance=0x7fc6e4008a10, metadata=0x0, metasize=<optimized out>, functor=<optimized out>, 
    future_proc=..., own_functor=false) at /scratch/eslaught/fuzzer/legion/runtime/legion/legion_tasks.cc:7528
#15 0x00007fc71b76116a in Legion::Internal::LeafContext::post_end_task (this=0x7fc6ec2d9000, instance=<optimized out>, metadata=<optimized out>, metasize=<optimized out>, 
    callback_functor=<optimized out>, own_callback_functor=<optimized out>) at /scratch/eslaught/fuzzer/legion/runtime/legion/legion_context.cc:25106
#16 0x00007fc71b7de30f in Legion::Internal::InnerContext::process_post_end_tasks (this=<optimized out>) at /scratch/eslaught/fuzzer/legion/runtime/legion/legion_context.cc:9130
#17 0x00007fc71b7de5d1 in Legion::Internal::InnerContext::handle_post_end_task (args=args@entry=0x7fc6ec2d0530) at /scratch/eslaught/fuzzer/legion/runtime/legion/legion_context.cc:12425
#18 0x00007fc71bb94af4 in Legion::Internal::Runtime::legion_runtime_task (args=0x7fc6ec2d0530, arglen=12, userdata=<optimized out>, userlen=<optimized out>, p=...)
    at /scratch/eslaught/fuzzer/legion/runtime/legion/runtime.cc:32287
#19 0x00007fc71a90a955 in Realm::LocalTaskProcessor::execute_task (this=0x556a8ddbbbd0, func_id=4, task_args=...) at /scratch/eslaught/fuzzer/legion/runtime/realm/bytearray.inl:150
#20 0x00007fc71a95c683 in Realm::Task::execute_on_processor (this=0x7fc6ec2d03b0, p=...) at /scratch/eslaught/fuzzer/legion/runtime/realm/bytearray.inl:39
#21 0x00007fc71a95c73a in Realm::UserThreadTaskScheduler::execute_task (this=<optimized out>, task=<optimized out>) at /scratch/eslaught/fuzzer/legion/runtime/realm/tasks.cc:1687
#22 0x00007fc71a95abc4 in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x556a8ddbbf60) at /scratch/eslaught/fuzzer/legion/runtime/realm/tasks.cc:1158
#23 0x00007fc71a963e9a in Realm::UserThread::uthread_entry () at /scratch/eslaught/fuzzer/legion/runtime/realm/threads.cc:1405

Command line:

build/src/fuzzer -fuzz:seed 5812 -fuzz:ops 1000 -ll:util 2 -dm:exact_region -level 4

Fuzzer version: https://github.com/StanfordLegion/fuzzer/commit/679604e2e67ebe311aacbbae298d19234d9e35f0

lightsighter commented 3 months ago

Fortunately I already know what that one is. It is an overzealous assertion. I updated it in copy_from but forgot to do the same in reduce_from https://gitlab.com/StanfordLegion/legion/-/merge_requests/1196

elliottslaughter commented 3 months ago

It's much less frequent now but I still hit:

#7  0x00007f242a41e729 in __assert_fail_base (fmt=0x7f242a5b4588 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", 
    assertion=0x7f242c3797b8 "own_instance || (own_dst && external_allocation && (redop->sizeof_rhs < size))", file=0x7f242c368c20 "/scratch/eslaught/fuzzer/legion/runtime/legion/runtime.cc", 
    line=3354, function=<optimized out>) at assert.c:92
#8  0x00007f242a42ffd6 in __GI___assert_fail (assertion=assertion@entry=0x7f242c3797b8 "own_instance || (own_dst && external_allocation && (redop->sizeof_rhs < size))", 
    file=file@entry=0x7f242c368c20 "/scratch/eslaught/fuzzer/legion/runtime/legion/runtime.cc", line=line@entry=3354, 
    function=function@entry=0x7f242c3796e0 "Legion::Internal::ApEvent Legion::Internal::FutureInstance::reduce_from(Legion::Internal::FutureInstance*, Legion::Internal::Operation*, Legion::ReductionOpID, const ReductionOp*, bool, Legion::Intern"...) at assert.c:101
#9  0x00007f242be198e0 in Legion::Internal::FutureInstance::reduce_from (this=0x7f23f82dbcd0, source=source@entry=0x7f23f82dbdf0, op=op@entry=0x7f23f836a7f0, redop_id=1048584, redop=0x5590070b1390, 
    exclusive=exclusive@entry=false, precondition=...) at /scratch/eslaught/fuzzer/legion/runtime/legion/runtime.cc:3408
#10 0x00007f242bc3e158 in Legion::Internal::MultiTask::fold_reduction_future (this=0x7f23f836a640, instance=0x7f23f82dbdf0, effects=...) at /usr/include/c++/9/bits/atomic_base.h:734
#11 0x00007f242bc64efb in Legion::Internal::IndexTask::reduce_future (this=0x7f23f836a640, point=..., inst=0x7f23f82dbdf0, effects=...)
    at /scratch/eslaught/fuzzer/legion/runtime/legion/legion_tasks.cc:10260
#12 0x00007f242bc650f1 in Legion::Internal::SliceTask::reduce_future (this=0x7f23fc060430, point=..., inst=0x7f23f82dbdf0, effects=...)
    at /scratch/eslaught/fuzzer/legion/runtime/legion/legion_tasks.cc:11838
#13 0x00007f242bc517cd in Legion::Internal::SliceTask::handle_future (this=0x7f23fc060430, effects=..., point=..., instance=instance@entry=0x7f23f82dbdf0, metadata=metadata@entry=0x0, metasize=0, 
    functor=0x0, future_proc=..., own_functor=false) at /scratch/eslaught/fuzzer/legion/runtime/legion/legion_tasks.cc:11871
#14 0x00007f242bc51b02 in Legion::Internal::PointTask::handle_post_execution (this=0x7f23e8a77860, instance=0x7f23f82dbdf0, metadata=0x0, metasize=<optimized out>, functor=<optimized out>, 
    future_proc=..., own_functor=false) at /scratch/eslaught/fuzzer/legion/runtime/legion/legion_tasks.cc:7528
#15 0x00007f242ba0c18a in Legion::Internal::LeafContext::post_end_task (this=0x7f23fce6db70, instance=<optimized out>, metadata=<optimized out>, metasize=<optimized out>, 
    callback_functor=<optimized out>, own_callback_functor=<optimized out>) at /scratch/eslaught/fuzzer/legion/runtime/legion/legion_context.cc:25106
#16 0x00007f242ba8932f in Legion::Internal::InnerContext::process_post_end_tasks (this=<optimized out>) at /scratch/eslaught/fuzzer/legion/runtime/legion/legion_context.cc:9130
#17 0x00007f242ba895f1 in Legion::Internal::InnerContext::handle_post_end_task (args=args@entry=0x7f23fcf6d1e0) at /scratch/eslaught/fuzzer/legion/runtime/legion/legion_context.cc:12425
#18 0x00007f242be3fae4 in Legion::Internal::Runtime::legion_runtime_task (args=0x7f23fcf6d1e0, arglen=12, userdata=<optimized out>, userlen=<optimized out>, p=...)
    at /scratch/eslaught/fuzzer/legion/runtime/legion/runtime.cc:32291
#19 0x00007f242abb5955 in Realm::LocalTaskProcessor::execute_task (this=0x5590070ba940, func_id=4, task_args=...) at /scratch/eslaught/fuzzer/legion/runtime/realm/bytearray.inl:150
#20 0x00007f242ac07683 in Realm::Task::execute_on_processor (this=0x7f23fcf6d060, p=...) at /scratch/eslaught/fuzzer/legion/runtime/realm/bytearray.inl:39

You can take a look at process 3096943 on c0001.

Command line (note I had to run 100,000 tests to hit this):

build/src/fuzzer -fuzz:seed 44554 -fuzz:ops 1000 -ll:util 2 -ll:cpu 3 -dm:exact_region -level 4

I'm still on a locally modified version of https://github.com/StanfordLegion/fuzzer/commit/679604e2e67ebe311aacbbae298d19234d9e35f0 that disabled inner tasks (and thus virtual mappings).

lightsighter commented 3 months ago

Pull the fixfuturereduce branch and try again.

elliottslaughter commented 3 months ago

I finished the following without any errors:

./runner.py --fuzzer=build/src/fuzzer -j40 -n100000 -o1000 --extra='-ll:util 2 -ll:cpu 3 -dm:exact_region'
elliottslaughter commented 3 months ago

I'm about half-way through running 1M tests in the configuration above. No failures so far. I'll let it run overnight to be sure, but I think we got this one.

elliottslaughter commented 3 months ago

Completed the following without failures:

./runner.py --fuzzer=build/src/fuzzer -j40 -n1000000 -o1000 -s100000 --extra='-ll:util 2 -ll:cpu 3 -dm:exact_region'