Open elliottslaughter opened 7 years ago
This is actually a bug in Legion Spy and not the runtime so I'm going to tag @TWarszawski and myself for visibility. We won't fix this right now, but we know we need to do work to improve Legion Spy. As an example of the bug here is an error message. The dataflow path does exist, it just goes through the serial_stencil task which hasn't been analyzed yet by Legion Spy.
Performing physical verification analysis for stencil.parallelized (UID 130)... ERROR: No dataflow path found to update field f (101) of instance Instance 0x600000000000000b of region requirement 3 of stencil.parallelized
Also worth noting that the non-determinism comes from the order in which Legion maps non-interfering tasks. Some out-of-order mapping causes graphs that Legion Spy can handle while others do not. In order mapping always generates a graph that is validated with Legion Spy which is to be expected.
Multi-node runs with Legion spy have been removed from CI testing with commit c610d1e56192a7b44e75edd800d0969edff2dcb7. Once this is fixed, those tests should be re-enabled.
(Don't just revert that commit, because we don't want the allow_failure: true
part to come back.)
This test is still failing, but Legion Spy validates it's execution when it does fail (e.g. when I explicitly remove the assertion, but still print out the failure). Interestingly when I turn on bounds checks I get a failure:
mebauer@sapling:~/legion/language$ LAUNCHER='mpirun -n 2 --bind-to none -x REALM_SYNTHETIC_CORE_MAP= -x INCLUDE_PATH -x LD_LIBRARY_PATH -x TERRA_PATH' ./regent.py ./tests/regent/run_pass/parallelize_tasksimage2.rg -ll:cpu 4 -fparallelize-dop 3,3 -fbounds-checks 1 -logfile spy1%.log Errors reported during runtime. :0: pointer int2d(fs1(), $__ghost1) is out-of-bounds
@magnatelee Is this a bug in the parallelizer in the master branch? Should I try I different branch?
I believe we should be able to turn multi-node Legion Spy tests back on as they all should be passing. We should weigh the cost of doing this though as they are likely to be more expensive than the single node verifications.
@lightsighter This is a bug in the parallelizer. The parallelizer currently uses bounding boxes for both constructing ghost regions and determining whether ghost regions should be used for stencil accesses. This clearly doesn't work when ghost regions are sparse, which is the case with the flag -fparallelize-dop 3,3
. The new parallelizer doesn't have this issue as it never uses bounding boxes, and I confirmed that the same test case passes the checks (with some modifications to make it compatible with the new parallelizer). I'll move this issue to the next milestone and close the issue once the new parallelizer is eventually merged.
The other thing worth discussing on this issue is whether we should turn on Legion Spy verification now for multi-node runs as it should be passing. If we do turn it on we should probably run it infrequently as it is much more expensive in the multi-node cases.
Multi-node Legion Spy is not enabled again in the master branch with this merge request: https://gitlab.com/StanfordLegion/legion/merge_requests/170 That means this issue is only about the parallelizer.
This appears to be a non-deterministic failure in Legion Spy on multi-node. Here are the commands to reproduce:
To reproduce, I recommend running multiple of these in parallel. Be sure to change the
spy1_%.log
andspy1_*.log
to 2, 3, etc. otherwise the different parallel runs will interfere with each other.When it fails, the error looks like: