Verification of addressing / memory accesses with VeriWasm or similar

cfallin commented 1 year ago

I've been working on VeriWasm in Wasmtime the last few weeks, and decided to dump a summary of my investigation so far and ideas for the right approach to take. (I'm doing this partly because I've been asked to context switch back to something else, now that it's clear that VeriWasm proper cannot work and we need a new tool; I estimate it would take ~1 month to build what I describe in the last section below.)

(This may be better as an RFC, and I can clean it up, expand it, and post it as one once I have time to properly do so!)

Summary

We should include built-in support in Wasmtime to verify that compiled code from Wasm modules is accessing only the memory that it is supposed to access. Such verification would take the form of ahead-of-time static analysis on the code, so that it would not require dynamic checks; and it should trust as little as possible in the existing compiler stack, to increase the confidence we have in its conclusions.

The ideal would be to verify machine code bytes with no other input, but in practice, Wasmtime will have to tell the tool what to assume about details such as runtime data structures; and certain kinds of analysis, like full recovery of a CFG in the presence of indirect branches, may require whole-program analysis that is too expensive to be practical.

This issue proposes a scheme that embeds a notion of "memory capabilities" in CLIF, along with (static) assertions on values (mostly with respect to ranges), and carries these through to the VCode level in Cranelift. The Wasm lowering would be modified to annotate every load/store with an associated memory capability that permits it; and we can then check the uses of these capabilities at the other end of the pipeline, after instruction selection is done. This is inspired by, but in some sense the dual to, the approach taken by VeriWasm.

History: VeriWasm

In Johnson et al.[^1], the VeriWasm tool is described, with an approach that performs a static analysis of disassembled machine code to prove that all heap accesses in AOT-compiled Wasm code are properly to the Wasm heap, and cannot escape the sandbox. (The tool also checks other properties, such as control-flow integrity, but that is outside the scope of this issue.)

[^1]: Evan Johnson et al. "Довер ́яй, но провер ́яй: SFI Safety for native compiled Wasm." In NDSS 2021. https://cseweb.ucsd.edu/~lerner/papers/wasm-sfi-ndss2021.pdf

VeriWasm was designed to verify code produced by Lucet, another AOT Wasm compiler that has since been subsumed by Wasmtime and EOL'd.

VeriWasm works by lifting machine code to a simple IR that describes the semantics of the code in a partial way -- recovering just enough of the dataflow to be able to see all address computations -- and then performing iterative dataflow analysis over that IR using an analysis lattice.

The analysis values computed to describe program values include:

HeapBase: this value is the heap base;
Bounded4GB: this value is less than 2^32;
and some others (e.g. for RIP-relative constant loads) that we won't discuss here.

The basic idea of the analysis is to mark a result of any 32-bit instruction as Bounded4GB in the abstract interpretation, and then allow only memory accesses of the form [HeapBase + Bounded4GB].

Issues with porting VeriWasm to Wasmtime

The obvious next step is to port VeriWasm to work with Wasmtime instead of Lucet. To understand why this is difficult, it may be useful to know how Lucet's generated code differed from Wasmtime's. Both use Cranelift, but there are two key deltas:

Lucet used simpler data structures: its implicit first argument to every Wasm function was a pointer to the start of the (one single) Wasm memory, and this memory was always statically-bounded; there was no separate vmctx, and no multi-memory or multi-table support. Furthermore Lucet was normally used (and VeriWasm only supported) static bounds-checking, i.e., use of guard regions up to 4GiB heap size.
The version of Cranelift used at the time that Lucet and VeriWasm were developed was much simpler than what we have today. In particular, the address-mode support was fairly primitive and predictable: heap accesses were always of the form [rBase + rOffset], where rBase was the heap base and rOffset was the offset (into the Wasm heap), computed separately. In contrast, Wasmtime uses modern Cranelift which can fold more work into addressing modes and perform more clever instruction selection. It also includes Spectre-guard logic, which is difficult to analyze from first principles without pattern-matching the whole thing.

Taken together, these two facts mean that we need a much more general analysis. In principle, we need to be able to:

Compute a summary of an accessed memory address that symbolically encodes its base (possibly from a field in a vmctx structure), its offset, and the range of that offset (possibly from another field in a vmctx structure);
Reason about value ranges, additions, and scaling (left-shift/multiply) operations, all on the field of integers modulo 2^32 or 2^64 (i.e., accounting for wraparound);
Solve inequalities, to show that values are within dynamically-defined bounds.

Example

Here is the simplest possible dynamic bounds-check example, compiled for x86-64 by Wasmtime, with annotations:

0000000000000000 <_wasm_function_0>:
       0:   55                      push   %rbp
       1:   48 89 e5                mov    %rsp,%rbp
       4:   44 8b c2                mov    %edx,%r8d                    ;; save Wasm address, and truncate to 32 bits
       7:   4c 89 c0                mov    %r8,%rax
       a:   48 83 c0 04             add    $0x4,%rax                    ;; compute end: address, plus 4-byte load size
       e:   0f 83 02 00 00 00       jae    16 <_wasm_function_0+0x16>   ;; if that overflowed, trap
      14:   0f 0b                   ud2
      16:   48 8b 57 58             mov    0x58(%rdi),%rdx              ;; load memory-0 length from vmctx+0x58
      1a:   48 31 c9                xor    %rcx,%rcx                    ;; clear rcx, to use in Spectre guard below
      1d:   4c 03 47 50             add    0x50(%rdi),%r8               ;; add memory-0 base (loaded from vmctx+0x50)
      21:   48 39 d0                cmp    %rdx,%rax                    ;; compare computed end-offset to length
      24:   4c 0f 47 c1             cmova  %rcx,%r8                     ;; if out-of-bounds, move zeros (in rcx) over pointer; Spectre guard
      28:   41 8b 00                mov    (%r8),%eax                   ;; do the load
      2b:   48 89 ec                mov    %rbp,%rsp
      2e:   5d                      pop    %rbp
      2f:   c3                      ret

This is compiled from

(module
  (memory 0 10)
  (func (param i32) (result i32)
        local.get 0
        i32.load))

with wasmtime compile --dynamic-memory-guard-size 0 --static-memory-guard-size 0 --static-memory-maximum-size 0 test.wat.

There are several things going on here that we need to be able to analyze:

We need to be able to represent each value symbolically:
- Wasm address, which we know is within 32-bit range (Bounded4GB from VeriWasm's Lucet analysis);
- Wasm address plus 4, which we'll connect with the size of the load later;
- somehow preserve that property by understanding the jae/ud2 trap sequence and flags output of add;
- know that rcx being zero is significant later (so, track either a Zero state or all known constant values);
- represent "memory base plus offset known to be up to 4GB plus 4";
- understand the comparison between the length and rax (which up to this point is just an arbitrary 32-bit value in the program), and join this with the cmova below ti understand the "clamp to zero if out of bounds" idiom; then represent "pointer to this offset in heap, unless this offset plus 4 is above this other value"... symbolically
- use all of the above facts to validate that the load of r8 at offset 0x28 is legal.

The basic issue here is that we need to (i) build up a massive body of facts about all register values in the program, and (ii) only later pattern-match uses of certain combinations of them to generate legal pointers into dynamically-bounded memory regions. So we need to generally symoblically summarize expressions, and we need some algebraic language of additions, scaling operations, and bounding operations that can be simplified, normalized, and used to do this recognition.

Dead-end Prototype: Symbolic Expressions

In this branch of VeriWasm (which goes with this branch of Wasmtime and this branch of yaxpeax-core), I prototyped a system that basically attempted the above. It defined an analysis lattice that could represent additions, scaling (multiplication by a constant), umin (unsigned min, a bounding operator, the basic nonlinear element), loads, known constant values, and a few miscellaneous atoms like vmctx, "some pointer into .text", and Unknown.

The analysis used a set of rewrite rules to try to simplify expressions: e.g., an arbitrary 32-bit instruction might produce UMin(Unknown, Const(0xffffffff)), then adding 4 might produce Add(Const(4), UMin(Unknown, Const(0xffffffff))) which simplifies to UMin(Unknown, Const(0x100000003)) because Add distributes over UMin. (Wrapping is handled later, when we analyze ranges; the operators are defined not to wrap, because otherwise one needs to handle arbitrary piecewise segments rather than ranges and it just gets too messy.)

The Wasmtime branch above was modified to generate expressions in this expression language that signify memory regions that are allowed to be accessed. This is then used alongside the analysis results in the above symbolic-expression form, with an analysis that computes the affine bound (constant, or base plus some scaled symbolic dynamic parameter), to try to prove each memory access safe.

This more-or-less works (not quite in the above branch, but in principle can be fixed) for static memories, but falls down completely for dynamically-bounded structures, either heaps or tables. The basic reason is that to do the in-bounds proof, one basically needs to build a full computer algebra system that can manipulate the symbolic expressions above in order to solve systems of inequalities. This is not only far too complex for a purpose-built tool that we need to trust, but is also very unlikely to be scalable.

Fixing the Approach

I believe we can fix the verifier by using two basic realizations:

The above is a kind of "forward search", in that it eagerly generates all possible facts about all values in the program, and then later tries to find a path through those facts to a proof that an address is safe.
This is a structure that is forced by the lattice-based dataflow analysis (which works in the forward direction).
This eager-expansion approach is also forced by the lack of any guidance from the compiler: even though when building the CLIF, cranelift-wasm knows what it is doing and what ranges values should have with respect to bounds checks, the analyzer intentionally tries to derive these facts from first principles (ostensibly to exclude the compiler from the TCB, but this also happens to make the problem intractible!)
Instead, we could try an approach that lets the compiler leave us hints about what facts we should be deriving and carrying forward about each value; and mark only those values that are important for address computation. This doesn't have to put the comiler in the TCB; rather, the hints are a sort of assertion or ascription that is checked. (This is e.g. the difference between a proof search tool and a proof verifier.)

In a phrase, the approach I want to propose is proof-carrying code.

Proposal: Memory Capabilities and Proof-Carrying Code

The general approach is in four steps::

Describe the areas of memory that the code is allowed to access, as part of the CLIF function body of any function (verified by this framework at least; this will not be required by default for all functions). These are essentially capabilities: they describe a set of conditions (address must satisfy these properties) and in return grant permission to load or store. Loads and stores in the CLIF can then be annotated with these capabilities.
- These are likely to be placed in the preamble of the IR, and use global values or similar to describe base and length. They may look surprisingly similar to the old heap definitions that we had; the difference will be that they describe all accesses along pointer-chasing path. So for example: a capability to access vmctx[0..sizeof(vmctx_t)]; another capability, chaining off of that, to access (*(vmctx+8))[0..*(vmctx+16)] (i.e., a dynamically-bounded memory with base and length fields in the vmctx struct); and another for a table.
Add assertions on certain SSA values that describe their ranges. These should be placed at defs (operators or blockparams) of the SSA values, such that validation can be done in linear time, without an iterative dataflow analysis. We should write a set of semantics for some operators sufficient to validate the kinds of assertions we want to make. For example, if we know v0 < 10 and v1 < 10, given v2 = iadd v0, v1 the framework should be able to validate the assertion that v2 < 20. Perhaps we also have some easy implicit ones, for efficiency: e.g., every i32-typed value is implicitly <= u32::MAX.
- It's not entirely clear to me all of the kinds of assertions we will want. At least inequalities that bound the range of values; perhaps also equalities; perhaps also symbolic assertions, such as "this value is the base address of this memory capability", if we need to break down access validation into finer-grained steps.
1. We then carry these assertions through to the VCode. This is simpler than it may at first seem, IMHO: we know already that a given vreg corresponds to a given SSA value, so we "just" need to transcribe the annotations onto the vregs. The egraph optimization infrastructure will need to replicate annotations as it clones nodes during elaboration, and rewrite rules that could affect address computation will need to create new assertions, but this is necessary work: of course it is the case that we should need to prove that our handling of iadd or uextend on an address is correct!
We then check the VCode, prior to register allocation: validate each assertion, and confirm that memory accesses use legal capabilities. (The ABI code will need to hold an ambient capability for the stack, unless we special-case those instructions.) Checking prior to register allocation is important because it allows us to leverage SSA to be flow-insensitive in our analysis: we don't need to track "v1 at this program point" as we evaluate an assertion, but only v1 at all.
- Note that taken alone, checking before RA means that RA2 is part of the TCB. However, we have a symbolic translation validation (the checker) in RA2 already; we can turn that on, and then the combination of this "assertion + memory capability validation" and the regalloc TV are enough to give a VeriWasm-like safety property.

martin-fink commented 10 months ago

Hi! Seeing that you have started to work on this, I just wanted to chime in that I've been working on a 'memory access verifier' too! :)

My approach follows what veriwasm and your prototype do: I build 'symbolic values' that model address computations/bounds checks and solve inequalities on them. As you have mentioned above, that is quite complex and not super scalable. The code is quite hacky and there's much to be improved in terms of performance and elegance, but it works for static and dynamic heaps (with or without spectre guards) and detects if e.g. https://github.com/bytecodealliance/wasmtime/security/advisories/GHSA-ff4p-7xrq-q5r8 is reintroduced into the code base when fuzzing with the checker enabled. You can find it at https://github.com/martin-fink/wasmtime.

I'm very excited to see how your PCC approach will differ from what I've built. :)

cfallin commented 10 months ago

Woah, that's really neat -- I wish I had known about this earlier!

Looking it over, I definitely agree that SymbolicValue is close to what I had tried with my first "veriwasm 2.0" branch, an AST of of operator nodes. I'm hopeful that the current PCC approach will be more scalable since the assertions (facts) aren't recursive ASTs but shallow facts instead, and the producer is expected to leave a flattened sequence of assertions along each intermediate value in the CLIF. I've sketched out the dynamic-memories approach I want to take and it should keep to a fairly narrow set of assertion types I think, without a general inequality sublanguage or solver, tailored to the code that cranelift-wasm actually generates.

It's really cool to see a working design point that can validate dynamic memories as well; thanks for this and I'll study it in more detail!

bytecodealliance / wasmtime