StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
657 stars 146 forks source link

Fuzzer Legion Spy failure: missing reduction #1662

Closed elliottslaughter closed 2 months ago

elliottslaughter commented 3 months ago

The fuzzer is generating programs that fail Legion Spy. Example:

$ build/src/fuzzer -fuzz:seed 1 -fuzz:ops 41 -fuzz:skip 27 -level legion_spy=2 -logfile spy_%.log
$ pypy3 ./legion/tools/legion_spy.py -lpa spy_0.log
...
ERROR: Missing reduction from field Field 0 of instance Instance 0x400000000000000a of region requirement 0 of void_leaf
Traceback (most recent call last):
  File "/Users/elliott/Programming/Legion/fuzzer/legion/tools/legion_spy.py", line 14650, in <module>
    main(temp_dir)
  File "/Users/elliott/Programming/Legion/fuzzer/legion/tools/legion_spy.py", line 14600, in main
    state.perform_physical_analysis(physical_checks)
  File "/Users/elliott/Programming/Legion/fuzzer/legion/tools/legion_spy.py", line 13701, in perform_physical_analysis
    if not top_task.perform_task_physical_verification(perform_checks):
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/elliott/Programming/Legion/fuzzer/legion/tools/legion_spy.py", line 9302, in perform_task_physical_verification
    if not op.perform_op_physical_verification(perform_checks):
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/elliott/Programming/Legion/fuzzer/legion/tools/legion_spy.py", line 8241, in perform_op_physical_verification
    if not point.op.perform_op_physical_verification(perform_checks, collective):
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/elliott/Programming/Legion/fuzzer/legion/tools/legion_spy.py", line 8333, in perform_op_physical_verification
    if not self.verify_physical_requirement(index, req, perform_checks,
  File "/Users/elliott/Programming/Legion/fuzzer/legion/tools/legion_spy.py", line 8222, in verify_physical_requirement
    if not req.logical_node.perform_physical_verification(depth, field,
  File "/Users/elliott/Programming/Legion/fuzzer/legion/tools/legion_spy.py", line 3536, in perform_physical_verification
    return self.perform_physical_verification(depth, field, op, req, inst,
  File "/Users/elliott/Programming/Legion/fuzzer/legion/tools/legion_spy.py", line 3541, in perform_physical_verification
    return self.parent.parent.perform_physical_verification(depth, field, op, req,
  File "/Users/elliott/Programming/Legion/fuzzer/legion/tools/legion_spy.py", line 3548, in perform_physical_verification
    if not state.perform_physical_verification(op, req, inst, perform_checks, register_now):
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/elliott/Programming/Legion/fuzzer/legion/tools/legion_spy.py", line 5483, in perform_physical_verification
    if not self.issue_update_copies(inst, op, req, perform_checks, error_str):
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/elliott/Programming/Legion/fuzzer/legion/tools/legion_spy.py", line 5521, in issue_update_copies
    return traverser.verify(op, restricted)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/elliott/Programming/Legion/fuzzer/legion/tools/legion_spy.py", line 5326, in verify
    assert False
AssertionError

Spy log: spy_0.log

Reproduce with this exact version of the Fuzzer: https://github.com/StanfordLegion/fuzzer/commit/5507f40ce17045c6c43d702d7cd8cc01dc72894b

lightsighter commented 3 months ago

Does this actually get the wrong answer or is it only Legion Spy complaining? I think there might be a bug in Legion Spy's verification of the application of multiple reduction epochs.

Aside: does the fuzzer even know when we get the wrong answer?

elliottslaughter commented 3 months ago

This version of the fuzzer does not do any checking. The initial plan was to use the fuzzer purely to generate "interesting" traces, and rely entirely on Legion Spy to validate if those traces are correct or not.

I could add some validation but it would be inherently incomplete. Because the most interesting cases are the ones that are the hardest to verify (and conversely, the ones I can easily verify would be shocking if Legion ever gets wrong), I don't honestly know if this is worth the effort. Writing a complete validation would amount to a new implementation of Legion Spy, which I am not going to do.

lightsighter commented 3 months ago

I think there's an easy way to allow the fuzzer to verify the outcome without relying on Legion Spy. Since all the regions being used are small, you can just make shadow versions of them, inline map the shadow regions, and then inline execute the tasks on the shadow regions. Then you can diff the test regions against the shadow regions at the end.

elliottslaughter commented 3 months ago

Good idea. I'll work on that.

lightsighter commented 3 months ago

It will help with differentiating when Legion Spy has a bug or not.

FWIW: there's actually already a comment in Legion Spy about why this test case is failing to verify: https://gitlab.com/StanfordLegion/legion/-/blob/master/tools/legion_spy.py?ref_type=heads#L4963-4971

elliottslaughter commented 3 months ago

For what it's worth, I added validation and have been running extensive tests with it, and nothing I've run so far has produced an incorrect answer.

lightsighter commented 3 months ago

I actually think the runtime analysis that we're testing here is pretty solid. Most of what we're finding are little idiosyncrasies in some of the state machines inside the runtime, but they aren't exactly essential for correctness, more for performance. There are more corners of the runtime to explore, but this particular corner is well explored.

lightsighter commented 3 months ago

This Legion Spy bug should be fixed with: https://gitlab.com/StanfordLegion/legion/-/merge_requests/1199

lightsighter commented 2 months ago

This merge request merged and this test now validates.