StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
680 stars 145 forks source link

Multi-node spy failure in parallelize_tasks_image2.rg #247

Open elliottslaughter opened 7 years ago

elliottslaughter commented 7 years ago

This appears to be a non-deterministic failure in Legion Spy on multi-node. Here are the commands to reproduce:

./install.py --debug --gasnet --spy
i=0; while LAUNCHER='mpirun -n 2 -H localhost --bind-to none -x REALM_SYNTHETIC_CORE_MAP= -x INCLUDE_PATH -x LD_LIBRARY_PATH -x TERRA_PATH' ./regent.py ./tests/regent/run_pass/parallelize_tasks_image2.rg -ll:cpu 4 -fparallelize-dop 3,3 -logfile spy1_%.log && ~/legion/tools/legion_spy.py -lpa spy1_*.log; do echo $i; let i++; done

To reproduce, I recommend running multiple of these in parallel. Be sure to change the spy1_%.log and spy1_*.log to 2, 3, etc. otherwise the different parallel runs will interfere with each other.

When it fails, the error looks like:

ERROR: No dataflow path found to update field f (101) of instance Instance 0x600000000000000f of region requirement 0 of stencil.parallelized
lightsighter commented 7 years ago

This is actually a bug in Legion Spy and not the runtime so I'm going to tag @TWarszawski and myself for visibility. We won't fix this right now, but we know we need to do work to improve Legion Spy. As an example of the bug here is an error message. The dataflow path does exist, it just goes through the serial_stencil task which hasn't been analyzed yet by Legion Spy.

Performing physical verification analysis for stencil.parallelized (UID 130)... ERROR: No dataflow path found to update field f (101) of instance Instance 0x600000000000000b of region requirement 3 of stencil.parallelized

event_graph_toplevel_2.pdf log1_0.txt log1_1.txt

lightsighter commented 7 years ago

Also worth noting that the non-determinism comes from the order in which Legion maps non-interfering tasks. Some out-of-order mapping causes graphs that Legion Spy can handle while others do not. In order mapping always generates a graph that is validated with Legion Spy which is to be expected.

streichler commented 7 years ago

Multi-node runs with Legion spy have been removed from CI testing with commit c610d1e56192a7b44e75edd800d0969edff2dcb7. Once this is fixed, those tests should be re-enabled. (Don't just revert that commit, because we don't want the allow_failure: true part to come back.)

lightsighter commented 5 years ago

This test is still failing, but Legion Spy validates it's execution when it does fail (e.g. when I explicitly remove the assertion, but still print out the failure). Interestingly when I turn on bounds checks I get a failure:

mebauer@sapling:~/legion/language$ LAUNCHER='mpirun -n 2 --bind-to none -x REALM_SYNTHETIC_CORE_MAP= -x INCLUDE_PATH -x LD_LIBRARY_PATH -x TERRA_PATH' ./regent.py ./tests/regent/run_pass/parallelize_tasksimage2.rg -ll:cpu 4 -fparallelize-dop 3,3 -fbounds-checks 1 -logfile spy1%.log Errors reported during runtime. :0: pointer int2d(fs1(), $__ghost1) is out-of-bounds

@magnatelee Is this a bug in the parallelizer in the master branch? Should I try I different branch?

I believe we should be able to turn multi-node Legion Spy tests back on as they all should be passing. We should weigh the cost of doing this though as they are likely to be more expensive than the single node verifications.

magnatelee commented 4 years ago

@lightsighter This is a bug in the parallelizer. The parallelizer currently uses bounding boxes for both constructing ghost regions and determining whether ghost regions should be used for stencil accesses. This clearly doesn't work when ghost regions are sparse, which is the case with the flag -fparallelize-dop 3,3. The new parallelizer doesn't have this issue as it never uses bounding boxes, and I confirmed that the same test case passes the checks (with some modifications to make it compatible with the new parallelizer). I'll move this issue to the next milestone and close the issue once the new parallelizer is eventually merged.

lightsighter commented 4 years ago

The other thing worth discussing on this issue is whether we should turn on Legion Spy verification now for multi-node runs as it should be passing. If we do turn it on we should probably run it infrequently as it is much more expensive in the multi-node cases.

lightsighter commented 4 years ago

Multi-node Legion Spy is not enabled again in the master branch with this merge request: https://gitlab.com/StanfordLegion/legion/merge_requests/170 That means this issue is only about the parallelizer.