Ideas and resources for fuzzing shared-everything threads

fitzgen commented 1 week ago

This issue is intended as a place for discussion about approaches we can all use to improve our assurances that we correctly implement this proposal. A place to share relevant papers, moonshot ideas, etc...

Unlike many other Wasm proposals,[^0] we can't effectively exercise this proposal's additions to the language by just simply generating pseudo-random Wasm modules that contain this proposal's new instruction and feeding those into the system under test.

[^0]: To some degree everything here applies equally to the original threads proposal, but shared-everything threads expands the scope much further.

Off the top of my head, we need to additionally do things like

actually spawn multiple threads
check whether the dynamic trace of accesses across threads is valid given the happens-before relationships imposed by those operations used to make those accesses
ideally avoid dead locks in these generated concurrent programs

fitzgen commented 1 week ago

FWIW, @abrown has a pull request implementing support for shared-everything threads in wasm-smith: https://github.com/bytecodealliance/wasm-tools/pull/1756

This is good for exercising validators, but won't give us much as far as engine-implementation correctness is concerned.

tlively commented 1 week ago

Ideas I've had include:

Use an abstract interpreter that collects all possible executions as the fuzzing oracle (but only extraordinarily simple programs would avoid combinatoric explosion in possible results).
Start by generating random happens-before/reads-from graphs that will produce reasonable numbers of possible results, then generate programs that embody those graphs.
Implement TSan instrumentation for WasmGC to check that fuzzer-generated programs that are intended to be race-free are in fact race-free (also useful for checking real-world programs for bugs).
Use methods other than differential fuzzing (e.g. translation validation) to find misoptimization bugs.

I'd be very interested in reading any available literature in this area.

fitzgen commented 1 week ago

Some things that seem vageuly related, but which I haven't actually read yet:

cfallin commented 1 week ago

The hardest part by far seems, to me right now at least, to be the generation of programs that terminate (as Nick mentions above "ideally don't deadlock").

I wonder if one could hit both goals -- actually use the primitives, and terminate -- with a grammar-based random testcase generator that builds out of primitives such that the testcase must terminate and must produce a deterministic result, by construction. Then test with the composition of two levels of randomness: (i) building test cases, (ii) running each test case many times, perhaps with artificial perturbation (token-passing in the style of deterministic execution proposals like DMP but with token order changed by some random seed). The oracle condition is: results match across perturbed executions.

The grammar could include primitives like: spawn into two subthreads, (recursively generate subprograms), join; lock/unlock a spinlock, carefully constructed so the spinlock is scoped to a level in the spawn hierarchy and locks/unlocks are well-paired; similar, with the atomic wakeup primitives, well-paired between subthreads; atomic-add on one single atomic, with the returned values summed into a checksum and not used otherwise. No other shared data; add other random computations as per normal fuzz testcase generation, perhaps with normal loads/stores to disjoint (per-subtask) addresses and to TLS. (Importantly, no memory accesses to memory "protected by a lock" -- not racy but would introduce nondeterminism as critical sections commute.)

This seems to possibly have an advantage of avoiding more complex oracles (building happens-before graphs and verifying dynamically or somesuch), with the usual downside of grammar-based fuzzing that one only gets the coverage one plans to generate (but maybe this gives enough coverage by exercising all the building blocks? empirical question for sure).

fitzgen commented 6 days ago

Thought: If the engine supports interruption and we are checking dynamic traces for validity, then we don’t need to generate terminating and non-deadlocking programs. We can just check as much of the trace as we got up to the interruption point.

fitzgen commented 6 days ago

Another source of possible inspiration would be Jensen: https://github.com/jepsen-io/jepsen

Jepsen is a Clojure library. A test is a Clojure program which uses the Jepsen library to set up a distributed system, run a bunch of operations against that system, and verify that the history of those operations makes sense.

WebAssembly / shared-everything-threads

Ideas and resources for fuzzing shared-everything threads #84