StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
689 stars 144 forks source link

Fuzzer: nondeterministic data corruption in DCR mode #1765

Closed elliottslaughter closed 1 month ago

elliottslaughter commented 1 month ago

I'm not sure if this is related to https://gitlab.com/StanfordLegion/legion/-/merge_requests/1473 but I'm running in that branch and seeing a data corruption bug:

$ salloc -n 1 -N 1 -c 40 -p all --exclusive
$ cd /scratch/eslaught/fuzzer-experiment-7-fix
$ i=0; while srun -n 2 --ntasks-per-node 2 ./build_debug_multi/src/fuzzer -fuzz:seed 22319258843034 -fuzz:ops 1000 -fuzz:skip 507 -ll:util 2 -ll:cpu 3 -fuzz:replicate 1 -lg:safe_ctrlrepl 1 -level 4; do let i++; echo $i; done
[0 - 7f4227104c40]    0.992555 {6}{fuzz}: Bad region value: 932941568, expected: 2181883299
[1 - 7ff146313c40]    0.992558 {6}{fuzz}: Bad region value: 932941568, expected: 2181883299

I'm doing a build with Legion Spy now so that I can compare what it says.

Fuzzer version: https://github.com/StanfordLegion/fuzzer/commit/3ef4c19266907eee6c5df86d9dc25b79b47f2d4b

Legion version: d8762439a692bb0492a3febb51bbf6e3c719d95c (from https://gitlab.com/StanfordLegion/legion/-/merge_requests/1473)

elliottslaughter commented 1 month ago

Here are the Legion Spy logs from a set of good and bad runs (i.e., that pass or fail the fuzzer's verification):

Unfortunately I cannot actually run Legion Spy on either of these log files:

$ pypy3 legion/tools/legion_spy.py -lpa bad_*.log
Reading log file bad_0.log...
Traceback (most recent call last):
  File "legion/tools/legion_spy.py", line 14751, in <module>
    main(temp_dir)
  File "legion/tools/legion_spy.py", line 14623, in main
    total_matches += state.parse_log_file(file_name)
  File "legion/tools/legion_spy.py", line 13229, in parse_log_file
    if parse_legion_spy_line(line, self):
  File "legion/tools/legion_spy.py", line 12834, in parse_legion_spy_line
    assert p1 not in state.point_point
AssertionError

I get the same result with the good_*.log files.

lightsighter commented 1 month ago

Fix here: https://gitlab.com/StanfordLegion/legion/-/merge_requests/1476

This is the best bug found yet as it is a really subtle one involving the interaction of control replication with some really strange patterns of index space task launches.

elliottslaughter commented 1 month ago

I finished 400 runs of the specific seed, and a full suite of single- and multi-node runs, and I think the issue is fixed. The only remaining issues I see are this point are the Realm failures reported in #1745.

lightsighter commented 1 month ago

Merged.