Multi-node test failure on optimize_spmd_launch_twice.rg

elliottslaughter commented 7 years ago

I'm seeing a test failure on optimize_spmd_launch_twice.rg. The test fails by producing an incorrect result. The failure is non-determinstic on sapling: I appear to hit the failure on about 1 in 10 runs. However, the failure appears to be very reliable on Gitlab. It may be that oversubscribing the machine improves the odds of reproducing the bug.

Note:

Despite the name of the test that's failing, SPMD is not required to fail this test. (I have been able to reproduce both with or without SPMD. The bug appears to reproduce more frequently without SPMD: 1 in 10 vs 1 in 100.)
Gitlab is using the MPI conduit, but I have successfully reproduced with both IBV and MPI conduits on Sapling.
So far I have only been able to reproduce with ranks running on the same physical node.
I have not yet been able to reproduce with Legion Spy enabled (after over 500 runs).
I can reproduce with the following Regent optimizations disabled: leaf tasks, inner tasks, RDIR (so no conversion to/from dataflow at all), vectorization.
I can also reproduce on the olddma branch.

Sample failure in Gitlab: https://gitlab.com/StanfordLegion/legion/builds/15192176

Steps to reproduce on Sapling. Run this on the head node. The following configuration has SPMD disabled.

./install.py --debug --gasnet
i=0; while LAUNCHER='mpirun -n 2 -H localhost --bind-to none -x REALM_SYNTHETIC_CORE_MAP= -x INCLUDE_PATH -x LD_LIBRARY_PATH -x TERRA_PATH' ./regent.py ./tests/regent/run_pass/optimize_spmd_launch_twice.rg -ll:cpu 2 -fflow-spmd 0; do let i++; echo $i; done

Sample bad output. Note the bad (zero) entry on the last line. The position of this bad entry is non-determinstic and varies between runs.

x 10000 0 0
x 10010 0 0
x 10020 0 0
x 10030 0 0
x 9669 -88 -747
x 10410 122 918
x 9676 -71 -867
x 0 21 680
assertion failed: test failed

Sample good output:

x 10000 0 0
x 10010 0 0
x 10020 0 0
x 10030 0 0
x 9669 -88 -747
x 10410 122 918
x 9676 -71 -867
x 10385 21 680

lightsighter commented 7 years ago

Interesting assertion that I noticed while testing, couldn't get a backtrace. Still need to investigate more. @streichler any ideas on this assertion? My guess is that it has something to do with remote copies. /local/home/mebauer/legion_versioning/runtime/realm/operation.cc:425: void Realm::OperationTable::event_triggered(Realm::Event): Assertion `it != table.end()' failed.

lightsighter commented 7 years ago

I've narrowed down the cause of this bug but I still don't understand it. It has to do with pruning operations from the set of 'current' and 'previous' epoch lists for a physical instance. Recall that (conceptually) we keep track of two lists of operations for every field of a physical instance. One for the current epoch of operations (operations which don't interfere with each other) and one for the previous epoch of operations. When a new operation comes along it looks for operations that it interferes with in the current list. If it finds no operations or doesn't interfere with all the operations then it does the same analysis on the previous list. If the operation interfered with all the operations in the current list then we clear the previous list, make the current list the previous list, and add the operation to the current list. Otherwise we can just add the operation to the current list directly.

The source of the problem seems to be that as we're running along and doing these tests, we have an optimization that first checks to see if the event for an operation in the list has already triggered. If we detect that the event for an operation has already triggered (by calling 'has_triggered') then we can eagerly prune it from the list without even testing for interference. Somehow this eager pruning of the lists is causing the non-deterministic failures. If I disable all the eager pruning of both lists then everything works fine. Surprisingly even eager pruning of either list individually also causes the bug to manifest (but disabling eager pruning of the previous list has a much greater effect on reducing non-determinism for this test). I'm completely at a loss as to why pruning at all is causing any issues as those operations have by definition already completed. Any ideas @elliottslaughter @streichler @magnatelee @TWarszawski @alexaiken?

There are two other potential things going on here. First, I'm trusting that the 'has_triggered' implementation is correct. If there is a bug that is causing 'has_triggered' to prematurely signal that an event has triggered then that would explain all of the behavior. I think this is highly unlikely given our use of 'has_triggered' in other parts of the runtime that all seem to work, but it is worth mentioning. Second, it's possible that not pruning the lists is just creating more work for the runtime and causing a sufficient timing delay to prevent a race somewhere else from occurring. I haven't see any evidence of this (e.g. disabling the pruning always makes the non-determinism go away without exception over ~10K tests), but it's still a possibility because I can't rule it out.

lightsighter commented 7 years ago

Another important observation: enabling/disabling the pruning seems to perfectly predict whether we see the non-determinism or not regardless of whether we run with optimizations on or off. The massive difference in timing with optimizations on/off would probably perturb a race condition if there was one, so I feel like this is the actual cause of the problem. I just want to understand why it is a problem before I go changing things.

lightsighter commented 7 years ago

I can kind of understand why we can't prune from the 'current' epoch list because it might influence when the current epoch list needs to become the previous epoch list. What I can't understand is why pruning from the previous epoch list causes problems. This is just completely baffling to me.

lightsighter commented 7 years ago

Ok, back to thinking it is a timing thing. Running with Legion Spy but keeping the pruning in and 150 runs later I still have yet to see the non-deterministic behavior. Clearly the cost of logging for Legion Spy prevents the non-deterministic behavior from occurring even with the pruning in.

Another hypothesis: one unusual thing about this example is it has a number chained cross-node copies. I'm wondering if there is a race somewhere in the multi-node copy code that is causing the finish event to trigger before all the data is actually in place in the destination.

lightsighter commented 7 years ago

More evidence for a bug in Realm: the non-determinism occurs even when running with -lg:inorder to force in order execution of the Legion tasks (and all their copies) in strict program order.

lightsighter commented 7 years ago

Switching from index space task launches to individual task launches using -findex-launch 0 causes the non-determinism to go away regardless of whether we run with -lg:inorder or not. It's unclear if this is due to a timing difference or a bug in the index space launch code somewhere.

elliottslaughter commented 7 years ago

With -findex-launch 0, is the mapper even sending tasks to remote nodes? This test just uses the default mapper, and I don't think the default mapper does anything fancy here.

lightsighter commented 7 years ago

Yes, it's still spreading the tasks around cause I can see the copies across nodes in the event graph.

lightsighter commented 7 years ago

I've done a bunch more testing and ruled out a bunch of potential causes. I know that it is not caused by index close operations. I was able to confirm that the issue goes away with local mapping suggesting that the problem involves some kind of meta-data movement (if all the meta-data is on one node then we don't have any issues). I then went and investigated the composite instances and how they are migrating the view trees. They all seem to be doing the right thing and bringing in the right meta-data. That leaves the next suspect which is migration of the materialized view meta data for the specific physical instances being used. I will look into that tomorrow.

lightsighter commented 7 years ago

@streichler @jiazhihao Can one of you guys look at these dma log files from a good and a bad run and tell me if the same number of copy requests were generated?

dma_logs.tar.gz

lightsighter commented 7 years ago

I've confirmed that Legion is issuing all the appropriate copies in every failed run. That leaves two potential sources for the non-determinism: either Legion is not specifying the proper event dependences for the copies which is causing a race or there is something racing in the DMA system to cause the data not to be in the proper location/arrive late/who knows.

lightsighter commented 7 years ago

Also happens with structured index spaces so nothing to do with Realm bit masks. optimize_spmd_launch_twice.txt

lightsighter commented 7 years ago

@elliottslaughter Can you try to reproduce the bug with the olddma branch? I just ran 500 iterations without it reproducing for me in the olddma branch. In the master branch it usually reproduces within the first 10 tries.

lightsighter commented 7 years ago

@jiazhihao Using registered memory for all the instances instead of system memory significantly reduced the probability of the non-determinism happening (went from 1 in 5 to about 1 in 60), but it still did occur. This means the race is still there (either in the DMA system or the Legion runtime), just that the timing was affected by the change in mapping decisions.

elliottslaughter commented 7 years ago

It took longer, but I repro'd on the olddma branch after 625 iterations. I don't remember it taking so long before, maybe there was other activity on the machine, or something.

lightsighter commented 7 years ago

This bug should be fixed with 3798b00. The moral of the story is that you can't ever trust Realm::merge_events to give you back a unique event. In fact, it may give you back one of the events that you asked it to merge if it is the only one that hasn't triggered yet. All of this I knew, but just didn't fully remember when writing some of the event merging code for composite instance views. Assigning back to @elliottslaughter to confirm the fix.

elliottslaughter commented 7 years ago

I can't reproduce any more on sapling (after 600+ runs), but unfortunately I'm still seeing this failure in gitlab:

https://gitlab.com/StanfordLegion/legion/builds/15396524

elliottslaughter commented 7 years ago

There's some unrelated junk in that log output that I'll have to look at separately. For the purpose of this bug, search the log for "spmd_launch_twice".

elliottslaughter commented 7 years ago

Update: The bug with -fflow-spmd 0 appears to be gone. The crashes above are only happening with -fflow-spmd 1. They might be a different bug.

lightsighter commented 7 years ago

Here are the Legion Spy logs from a run that gets the wrong answer with -fflow-spmd 1. Legion Spy validates the runtime's execution. It does look weird because I don't see any of the compiler's normal generation of explicit copies into and out of the must-epoch launches.

bad_logs.tar.gz dataflow_main_2.pdf event_graph_main_2.pdf

magnatelee commented 7 years ago

@lightsighter The explicit copies into shards are now being done inside each shard. You can see from your event graph that each shard starts with some copies.

lightsighter commented 7 years ago

dataflowshard$893_10.pdf @magnatelee So I see the copies in at the start of the shard tasks, but I don't see the copies back out to the original regions at the end of the shard tasks.

lightsighter commented 7 years ago

@magnatelee @elliottslaughter Also, how are you guys guaranteeing that all the copies-in happen before any copy-out from a different shard is done? Do you have explicit phase barriers for that? I don't think you can trust that such an ordering will be enforced by the data dependences inside of all the shards unless you explicitly analyze that.

magnatelee commented 7 years ago

@lightsighter the copies out of a must-epoch launch are inserted right after the launch and issued only when the partition bloated by the spmd transformation is not completely covered by any other partitions. I guess in this test case the aliased partition that spmd choses to bloat is completely covered by the primary partition. The ordering between the must-epoch tasks and the copy-outs afterwards is guaranteed by data dependence and the copies into shards are explicitly synchronized with inlined phase barrier awaits at the beginning of shards. I believe the placement and synchronization of copies are both deterministic and correct.

lightsighter commented 7 years ago

I still have questions: first where are the copy outs? I can't see any copy outs in any of the shard tasks from any of the three must epoch launches.

Second: there is no guarantee of ordering between the must epoch launches. A shard from the second must epoch launch can start running as soon as the processor that it is mapped to allows it to start running. This follows from simultaneous coherence. This means that the execution of the shards from different must epoch launches can be interleaved and the only way to enforce ordering across shards from different must epoch launches is with phase barriers. There is no dependence enforced by the runtime. You can see this in the event graph: the shards from all three must epoch launches are interleaved with each other.

elliottslaughter commented 7 years ago

I have what I hope is a fix in aa44670aae2e91cf795344b2974da0119c3e65c0. The hack for now is to add execution fences before and after every must epoch launch. In theory, it would be cleaner to use an inner task with exclusive coherence, but I'm concerned that approach would have a much higher risk of introducing performance bugs.

elliottslaughter commented 7 years ago

This issue appears to be fixed; I have not seen this test fail on the last several commits in Gitlab.

StanfordLegion / legion

Multi-node test failure on optimize_spmd_launch_twice.rg #243