StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
657 stars 146 forks source link

Fuzzer assertion in refine_node (with virtual mappings) #1659

Closed elliottslaughter closed 2 months ago

elliottslaughter commented 3 months ago

My new fuzzer is hitting an assertion. Note the failure is nondeterministic despite the fact that the fuzzer itself is deterministic and I am running in debug mode with -lg:inorder:

Assertion failed: ((current_sets == NULL) || (mask * current_sets->get_valid_mask()) || refine_current), function refine_node, file region_tree.inl, line 7311.

Backtrace:

* thread #8, stop reason = hit program assert
    frame #0: 0x00007ff8066d57a2 libsystem_kernel.dylib`__pthread_kill + 10
    frame #1: 0x00007ff80670df30 libsystem_pthread.dylib`pthread_kill + 262
    frame #2: 0x00007ff80662ca49 libsystem_c.dylib`abort + 126
    frame #3: 0x00007ff80662bd30 libsystem_c.dylib`__assert_rtn + 314
  * frame #4: 0x0000000105eef0d0 liblegion.1.dylib`Legion::Internal::EqKDNode<1, long long>::refine_node(this=0x00007feac0f15ee0, rect=0x000070000ca6b2c0, mask=0x000070000ca6b170, refine_current=false) at region_tree.inl:7310:7
    frame #5: 0x0000000105edff51 liblegion.1.dylib`Legion::Internal::EqKDNode<1, long long>::initialize_set(this=0x00007feac0f15ee0, set=0x00007feb1201f200, rect=0x000070000ca6b2c0, mask=0x00006000018b47b0, local_shard=0, current=true) at region_tree.inl:6855:13
    frame #6: 0x0000000105edaf6d liblegion.1.dylib`Legion::Internal::EqKDSparse<1, long long>::initialize_set(this=0x00006000023b4ec0, set=0x00007feb1201f200, rect=0x000070000ca6b330, mask=0x00006000018b47b0, local_shard=0, current=true) at region_tree.inl:8695:18
    frame #7: 0x0000000105e114fb liblegion.1.dylib`Legion::Internal::IndexSpaceNodeT<1, long long>::initialize_equivalence_set_kd_tree(this=0x00007feb06016400, tree=0x00006000023b4ec0, set=0x00007feb1201f200, mask=0x00006000018b47b0, local_shard=0, current=true) at region_tree.inl:5154:25
    frame #8: 0x00000001053b39c7 liblegion.1.dylib`Legion::Internal::InnerContext::initialize_region_tree_contexts(this=0x00007feb06022800, clone_requirements=size=1, version_infos=size=1, unmap_events=size=1) at legion_context.cc:10859:34
    frame #9: 0x0000000105831dd7 liblegion.1.dylib`Legion::Internal::SingleTask::launch_task(this=0x00007feb12825000, inline_task=false) at legion_tasks.cc:4742:28
    frame #10: 0x000000010585809e liblegion.1.dylib`Legion::Internal::SliceTask::map_and_launch(this=0x00007feb12823c00) at legion_tasks.cc:11517:18
    frame #11: 0x00000001058379b6 liblegion.1.dylib`Legion::Internal::MultiTask::trigger_mapping(this=0x00007feb12823c00) at legion_tasks.cc:5646:15
    frame #12: 0x0000000105ba7482 liblegion.1.dylib`Legion::Internal::Runtime::legion_runtime_task(args=0x00007feac0f15de8, arglen=12, userdata=0x00006000034acac0, userlen=8, p=(id = 2089670227099910144)) at runtime.cc:32306:31
    frame #13: 0x0000000109e5b158 librealm.1.dylib`Realm::LocalTaskProcessor::execute_task(this=0x00007feb11928d30, func_id=4, task_args=0x000070000ca6c8e0) at proc_impl.cc:1176:5
    frame #14: 0x0000000109f01fd8 librealm.1.dylib`Realm::Task::execute_on_processor(this=0x00007feac0f15cb0, p=(id = 2089670227099910144)) at tasks.cc:326:40
    frame #15: 0x0000000109f08ccc librealm.1.dylib`Realm::KernelThreadTaskScheduler::execute_task(this=0x00007feb11928450, task=0x00007feac0f15cb0) at tasks.cc:1421:11
    frame #16: 0x0000000109f07355 librealm.1.dylib`Realm::ThreadedTaskScheduler::scheduler_loop(this=0x00007feb11928450) at tasks.cc:1160:6
    frame #17: 0x0000000109f07d5a librealm.1.dylib`Realm::ThreadedTaskScheduler::scheduler_loop_wlock(this=0x00007feb11928450) at tasks.cc:1272:5
    frame #18: 0x0000000109f1cb6c librealm.1.dylib`void Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop_wlock()>(obj=0x00007feb11928450) at threads.inl:97:5
    frame #19: 0x0000000109f20f18 librealm.1.dylib`Realm::KernelThread::pthread_entry(data=0x00006000001a0420) at threads.cc:831:5
    frame #20: 0x00007ff80670e202 libsystem_pthread.dylib`_pthread_start + 99
    frame #21: 0x00007ff806709bab libsystem_pthread.dylib`thread_start + 15

The trace that generates the assertion is quite short:

$ ./src/fuzzer -level 4,fuzz=2 -fuzz:ops 10 -fuzz:seed 0 -lg:leaks -lg:inorder
[0 - 7ff849d357c0]    0.000158 {4}{threads}: reservation ('dedicated worker (generic) #1') cannot be satisfied
[0 - 70000c0bf000]    0.078761 {3}{fuzz}: Fuzzer Configuration:
[0 - 70000c0bf000]    0.078783 {3}{fuzz}:   config.initial_seed = 0
[0 - 70000c0bf000]    0.078786 {3}{fuzz}:   config.region_tree_depth = 1
[0 - 70000c0bf000]    0.078788 {3}{fuzz}:   config.region_tree_width = 4
[0 - 70000c0bf000]    0.078790 {3}{fuzz}:   config.region_tree_branch_factor = 4
[0 - 70000c0bf000]    0.078795 {3}{fuzz}:   config.region_tree_num_fields = 4
[0 - 70000c0bf000]    0.078798 {3}{fuzz}:   config.num_ops = 10
[0 - 70000c0bf000]    0.132885 {2}{fuzz}: Operation: 0
[0 - 70000c0bf000]    0.132905 {2}{fuzz}:   Launch type: single task
[0 - 70000c0bf000]    0.132909 {2}{fuzz}:   Task ID: 1
[0 - 70000c0bf000]    0.132913 {2}{fuzz}:   Launch domain: <2>..<3>
[0 - 70000c0bf000]    0.132916 {2}{fuzz}:   Elide future return: 1
[0 - 70000c0bf000]    0.132918 {2}{fuzz}:   Field set: 0x1
[0 - 70000c0bf000]    0.132922 {2}{fuzz}:   Privilege: 0x1
[0 - 70000c0bf000]    0.132926 {2}{fuzz}:   Partition: LogicalPartition(1,IndexPartition(5,1),FieldSpace(1))
[0 - 70000c0bf000]    0.132929 {2}{fuzz}:   Shifting shard points by: 2
[0 - 70000c0bf000]    0.132933 {2}{fuzz}:   Task: 2
[0 - 70000c0bf000]    0.132936 {2}{fuzz}:     Shard point: (2)
[0 - 70000c0bf000]    0.135129 {2}{fuzz}:   Task: 3
[0 - 70000c0bf000]    0.135149 {2}{fuzz}:     Shard point: (3)
[0 - 70000c0bf000]    0.137861 {2}{fuzz}: Operation: 1
[0 - 70000c0bf000]    0.137876 {2}{fuzz}:   Launch type: index space
[0 - 70000c0bf000]    0.137880 {2}{fuzz}:   Task ID: 2
[0 - 70000c0bf000]    0.137884 {2}{fuzz}:   Launch domain: <1>..<2>
[0 - 70000c0bf000]    0.137887 {2}{fuzz}:   Elide future return: 1
[0 - 70000c0bf000]    0.137890 {2}{fuzz}:   Field set: 0x4
[0 - 70000c0bf000]    0.137894 {2}{fuzz}:   Privilege: 0x7
[0 - 70000c0bf000]    0.137896 {2}{fuzz}:   Projection: 1
[0 - 70000c0bf000]    0.137901 {2}{fuzz}:   Partition: LogicalPartition(1,IndexPartition(2,1),FieldSpace(1))
[0 - 70000c0bf000]    0.140866 {2}{fuzz}: Operation: 2
[0 - 70000c0bf000]    0.140878 {2}{fuzz}:   Launch type: index space
[0 - 70000c0bf000]    0.140882 {2}{fuzz}:   Task ID: 3
[0 - 70000c0bf000]    0.140885 {2}{fuzz}:   Launch domain: <1>..<3>
[0 - 70000c0bf000]    0.140888 {2}{fuzz}:   Scalar redop: 1048580
[0 - 70000c0bf000]    0.140891 {2}{fuzz}:     Ordered: 1
[0 - 70000c0bf000]    0.140893 {2}{fuzz}:   Elide future return: 1
[0 - 70000c0bf000]    0.140896 {2}{fuzz}:   Field set: 0x0
[0 - 70000c0bf000]    0.140898 {2}{fuzz}:   Privilege: 0x10000007
[0 - 70000c0bf000]    0.140901 {2}{fuzz}:   Projection: 0
[0 - 70000c0bf000]    0.140904 {2}{fuzz}:   Partition: LogicalPartition(1,IndexPartition(1,1),FieldSpace(1))
[0 - 70000c0bf000]    0.141547 {2}{fuzz}: Operation: 3
[0 - 70000c0bf000]    0.141563 {2}{fuzz}:   Launch type: index space
[0 - 70000c0bf000]    0.141567 {2}{fuzz}:   Task ID: 2
[0 - 70000c0bf000]    0.141570 {2}{fuzz}:   Launch domain: <0>..<3>
[0 - 70000c0bf000]    0.141573 {2}{fuzz}:   Elide future return: 1
[0 - 70000c0bf000]    0.141576 {2}{fuzz}:   Field set: 0x6
[0 - 70000c0bf000]    0.141581 {2}{fuzz}:   Privilege: 0x0
[0 - 70000c0bf000]    0.141583 {2}{fuzz}:   Projection: 0
[0 - 70000c0bf000]    0.141587 {2}{fuzz}:   Partition: LogicalPartition(1,IndexPartition(1,1),FieldSpace(1))
[0 - 70000c0bf000]    0.142309 {2}{fuzz}: Operation: 4
[0 - 70000c0bf000]    0.142321 {2}{fuzz}:   Launch type: single task
[0 - 70000c0bf000]    0.142324 {2}{fuzz}:   Task ID: 2
[0 - 70000c0bf000]    0.142327 {2}{fuzz}:   Launch domain: <0>..<0>
[0 - 70000c0bf000]    0.142330 {2}{fuzz}:   Elide future return: 1
[0 - 70000c0bf000]    0.142333 {2}{fuzz}:   Field set: 0x9
[0 - 70000c0bf000]    0.142337 {2}{fuzz}:   Privilege: 0x4
[0 - 70000c0bf000]    0.142339 {2}{fuzz}:   Region redop: 1048685
[0 - 70000c0bf000]    0.142343 {2}{fuzz}:   Partition: LogicalPartition(1,IndexPartition(2,1),FieldSpace(1))
[0 - 70000c0bf000]    0.142346 {2}{fuzz}:   Shifting shard points by: 1
[0 - 70000c0bf000]    0.142349 {2}{fuzz}:   Task: 0
[0 - 70000c0bf000]    0.142352 {2}{fuzz}:     Shard point: (0)
[0 - 70000c0bf000]    0.145289 {2}{fuzz}: Operation: 5
[0 - 70000c0bf000]    0.145298 {2}{fuzz}:   Launch type: index space
[0 - 70000c0bf000]    0.145301 {2}{fuzz}:   Task ID: 1
[0 - 70000c0bf000]    0.145304 {2}{fuzz}:   Launch domain: <0>..<1>
[0 - 70000c0bf000]    0.145307 {2}{fuzz}:   Scalar redop: 1048625
[0 - 70000c0bf000]    0.145309 {2}{fuzz}:     Ordered: 1
[0 - 70000c0bf000]    0.145312 {2}{fuzz}:   Elide future return: 0
[0 - 70000c0bf000]    0.145314 {2}{fuzz}:   Field set: 0x4
[0 - 70000c0bf000]    0.145318 {2}{fuzz}:   Privilege: 0x7
[0 - 70000c0bf000]    0.145320 {2}{fuzz}:   Projection: 0
[0 - 70000c0bf000]    0.145324 {2}{fuzz}:   Partition: LogicalPartition(1,IndexPartition(2,1),FieldSpace(1))
[0 - 70000c0bf000]    0.148133 {2}{fuzz}: Operation: 6
[0 - 70000c0bf000]    0.148148 {2}{fuzz}:   Launch type: index space
[0 - 70000c0bf000]    0.148151 {2}{fuzz}:   Task ID: 3
[0 - 70000c0bf000]    0.148155 {2}{fuzz}:   Launch domain: <2>..<3>
[0 - 70000c0bf000]    0.148157 {2}{fuzz}:   Scalar redop: 1048655
[0 - 70000c0bf000]    0.148160 {2}{fuzz}:     Ordered: 0
[0 - 70000c0bf000]    0.148162 {2}{fuzz}:   Elide future return: 0
[0 - 70000c0bf000]    0.148165 {2}{fuzz}:   Field set: 0xb
[0 - 70000c0bf000]    0.148170 {2}{fuzz}:   Privilege: 0x1
[0 - 70000c0bf000]    0.148172 {2}{fuzz}:   Projection: 2
[0 - 70000c0bf000]    0.148176 {2}{fuzz}:   Partition: LogicalPartition(1,IndexPartition(2,1),FieldSpace(1))
Assertion failed: ((current_sets == NULL) || (mask * current_sets->get_valid_mask()) || refine_current), function refine_node, file region_tree.inl, line 7311.

If you want to reproduce, you must run exactly the command line above, and use this exact commit of the fuzzer: https://github.com/StanfordLegion/fuzzer/commit/f8cf28f6450f1cdd2172f8170774653a2c3faf11

I'm on my Mac but in theory it should be possible to reproduce this anywhere.

elliottslaughter commented 3 months ago

Keeping in mind the fact that this one crashes, the Legion Spy logs that I am able to generate pass logical and physical verification. Attached here:

spy_0.log

lightsighter commented 3 months ago

This isn't a bug in the physical analysis (which is why Legion Spy is fine). This is (yet another) issue with virtual mappings. Fixing it will take a bit of work, but I at least understand it. If you turn off virtual mappings in the mapper then it should go away completely.

elliottslaughter commented 3 months ago

Ok, I have confirmed that this one goes away if I modify the fuzzer not to use any virtual mappings.

elliottslaughter commented 3 months ago

Worked around via https://github.com/StanfordLegion/fuzzer/commit/3d77d6c97241517056e1e05bb0ab46ff41a82c1e.

I will keep this issue open as there is an underlying Legion bug here, even if I can work around to avoid it.

lightsighter commented 3 months ago

Will be fixed with this: https://gitlab.com/StanfordLegion/legion/-/merge_requests/1191

Note this will make virtual mapping initialization for read-only and reduction-only privileges considerably more expensive than they were before but it is necessary for correctness. (Although not as expensive as virtual mappings for read-write privileges which is easily another order of magnitude worse.)

elliottslaughter commented 3 months ago

Confirmed the fix with the original reproducer.

lightsighter commented 2 months ago

Fixed with the merge request above.