Fuzzer reproduces low-probability Realm crashes on Sapling

elliottslaughter commented 2 months ago

These are known Realm crashes that have been previously reported, but in the past reproducing them has been very tricky. The good news is that the Fuzzer can be used to reproduce these crashes, and it works directly on Sapling, sidestepping the need to mess with CI or containers.

Note that all reproducers are on Sapling.

Failure modes

Here's a sample of the failure modes that I am able to reproduce. Note that these are essentially random, so unlike typical fuzzer configurations I'm not sure there's anything inherent to specific seeds that provide any meaning here; we're just getting (un)lucky in particular runs leading to various errors.

MutexChecker:

[0 - 7f3d47a57c40]    9.358371 {6}{mutex}: over limit on entry into MutexChecker(xpair push,0x7f3cce70ac30) limit=1 actval=1 at stack trace: 10 frames

ChunkedRecycler:

fuzzer: /scratch/eslaught/fuzzer-experiment-6-debug-multi/legion/runtime/realm/gasnetex/gasnetex_internal.cc:1002: Realm::ChunkedRecycler<T, CHUNK_SIZE>::~ChunkedRecycler() [with T = Realm::GASNetEXEvent; unsigned int CHUNK_SIZE = 64]: Assertion `cur_alloc.load() == 0' failed.

Network not quiescent:

[0 - 7fe84eeedc40]    9.147451 {6}{realm}: network still not quiescent after 10 attempts

Instructions

To reproduce:

cd /scratch/eslaught/fuzzer-experiment-6-debug-multi
source experiment/sapling/env.sh
./experiment/sapling/run_all_tests.sh

To rebuild (note you'll have to become me on the machine, or you can make a copy of all the build directories):

srun -n 1 -N 1 -c 40 -p all --exclusive --pty bash --login
cd /scratch/eslaught/fuzzer-experiment-6-debug-multi/legion/build_debug_multi
make clean && make install -j20
cd ../../build_debug_multi
make clean && make -j20

I can also provide from-scratch reproducer instructions if you'd prefer to do this in your own account.

Fuzzer version: https://github.com/StanfordLegion/fuzzer/commit/3ef4c19266907eee6c5df86d9dc25b79b47f2d4b

Legion version: afd91610471c0f1bf1fdf37d56759c9e5da8b763

lightsighter commented 2 months ago

How long does it take to reproduce some of these?

elliottslaughter commented 2 months ago

The only way I've found to reproduce is to do an entire run of 10k tests. I suspect that we're talking about a very rare race condition, so the only way to make it happen is to slam the machine for an extended period of time to force the threads to interleave in a very particular way. In the end it probably takes 15 minutes to get some interesting failures, but it requires the full script run to do so.

The good news is that you can still use REALM_FREEZE_ON_ERROR so it should be possible to get all of these in a debugger.

elliottslaughter commented 2 weeks ago

Copying from @apryakhin in https://github.com/StanfordLegion/legion/issues/1305#issuecomment-2427555302

@elliottslaughter This is a probably discussed already somewhere else but I recall you have done a "fuzz testing" done a relatively short time ago that exposed a number of bugs. Would you be able to describe what sort of fuzz tester is it? Or perhaps point to a place that has some context on it. I would be open to discuss integrating the fuzz testing for Realm. Either as a standalone tool that we run/maintain ourselves or something derived from what you have already done.

Here is the issue with the fuzzing results. These replicate already-known failure modes in Realm and so they provide an easier way to reproduce (potentially). See the top of this issue for instructions.

The fuzzer lives here: https://github.com/StanfordLegion/fuzzer

There is a short design document on that page that gives an overview of how things work.

Overall, one of the things to understand is that this is a Legion fuzzer. It relies pretty fundamentally on sequential semantics (at least for the validation part of the check). Now of course, some of the principles might also apply to direct Realm fuzzing as well. But you would need to determine what the goals are for fuzzing Realm, how you would ensure that all generated programs are valid, how to validate results, etc. I personally suspect the answers will look quite different.

You're welcome to run the Legion fuzzer and I think there will be some benefit to doing so, especially if I can add GPU tasks and such. It already does single and multi-node in various modes. But fundamentally, we're talking about today is fuzzing Realm through Legion vs. doing it directly.

StanfordLegion / legion

Fuzzer reproduces low-probability Realm crashes on Sapling #1745

Failure modes

Instructions