GaloisInc / crucible

Crucible is a library for symbolic simulation of imperative programs
610 stars 42 forks source link

Support opaque pointers in `crucible-llvm` and `uc-crux-llvm` #1075

Open RyanGlScott opened 1 year ago

RyanGlScott commented 1 year ago

After upgrading llvm-pretty and llvm-pretty-bc-parser to support LLVM's opaque ptr type (see https://github.com/elliottt/llvm-pretty/issues/102 and https://github.com/GaloisInc/llvm-pretty-bc-parser/issues/177, respectively), we will also need to upgrade crucible-llvm and uc-crux-llvm to handle them.

crucible-llvm

To a large extent, crucible-llvm is already designed to handle opaque pointers, as the LLVMPtr type intentionally does not include a pointee type. That being said, there are still various places in crucible-llvm that inspect pointee types of LLVM pointers, which will need to be changed to support the ptr type. Here are all the places that I have found:

uc-crux-llvm

The largest challenge that I can foresee is figuring out how to deal with PtrShape:

https://github.com/GaloisInc/crucible/blob/ad4a553487eeb5c6bbb5abf4bde26af905bf0254/uc-crux-llvm/src/UCCrux/LLVM/Shape.hs#L85-L88

If I'm reading this correctly, Shape is indexed by the pointee type to some degree, which won't work in an opaque pointer setting. That being said, I'm not quite sure how this code works, so I might be misunderstanding what is going on here.

kquick commented 1 year ago

I agree on changing the equality perspective for overrides.

langston-barrett commented 1 year ago

The dependency on pointer types is deeply, deeply ingrained into UC-Crux-LLVM. UC-Crux attempts to uncover function preconditions by repeatedly executing a function starting from "minimal" symbolic inputs and attempting to "fix" any undefined behavior it encounters (see this post). In particular, when allocating a pointer in the input, it uses the type of that pointer to determine the size of the allocation, and later, to deduce reasonable preconditions that might be applicable to the backing memory (e.g., if it's a pointer to an array of integers, some of those integers might have some maximum bound). Excising the use of pointer types would be a substantial undertaking.

RyanGlScott commented 1 year ago

I certainly believe you when you say that this would be a substantial undertaking. That being said, I am curious to know how uc-crux-llvm's current approach to tracking pointee types works, if only to better my understanding of how much work would be required:

In particular, when allocating a pointer in the input, it uses the type of that pointer to determine the size of the allocation, and later, to deduce reasonable preconditions that might be applicable to the backing memory (e.g., if it's a pointer to an array of integers, some of those integers might have some maximum bound).

At first glance, this approach seems to be compatible with opaque pointers. Even if pointer types are themselves opaque, their allocation sites are always annotated with their type. For example, an alloca <ty> instruction will return something of type ptr, but the <ty> should tell you everything you need to know about size, preconditions, etc.

The part where this gets tricky is when LLVM reinterprets the underlying memory at a different type, be it through bitcasts, getelementptr, or some other mechanism. For example, see here for a tricky example where a pointer is allocated with a "pointee type" of [2 x [2 x i8]], but a subsequent store instruction reinterprets the underlying memory at type i32. How does uc-crux-llvm deal with this sort of situation?

langston-barrett commented 1 year ago

Even if pointer types are themselves opaque, their allocation sites are always annotated with their type.

Except for heap allocations :slightly_smiling_face: But anyway, the problem is that UC-Crux attempts to execute individual functions from the program, and must allocate memory itself to do so. Consider this function:

int f(int *x) {
    return x[5] + 1;
}

UC-Crux would first execute this with an "intial pointer" (see the blog post). It would fail when reading from x, because there wouldn't be any allocation backing it. On the next attempt, UC-Crux would use the function's signature to deduce that x is a pointer to an array of integers, and allocate an array of integers with length 6. Executing the function this time results in a signed integer overflow, so UC-Crux would deduce that the integer at index 5 must be signed-less-than INT_MAX. After this modification, execution would succeed.

These preconditions are stored in a Shape, as you highlighted above. The Shape is tied to (and type-indexed by) the LLVM type of the function argument, and the logic for deducing the preconditions depends on the type/shape involved. Shapes are indexed by Cursors, which also use high-level notions like struct and array types, rather than byte offsets.

To handle opaque pointers, this logic would all need to be based on offsets. Additionally, UC-Crux would either have to make shot-in-the-dark guesses about allocation sizes, or would have to instrument load/store/getelementptr instructions to track the types used and relate them back to the corresponding part of the input being loaded from.

How does uc-crux-llvm deal with this sort of situation?

It gives up a lot! It can simply fail to find adequate preconditions.

RyanGlScott commented 1 year ago

Thanks, this is helpful to know.

For the time being, I suppose we could simply not support opaque pointers in uc-crux-llvm, and we could have uc-crux-llvm always invoke Clang with the necessary options to disable opaque pointers in LLVM bitcode. Unfortunately, this is not a viable long-term strategy, as LLVM 17 will drop support for non-opaque pointers entirely. We will need to have some kind of coping strategy if excising uc-crux-llvm of pointee types proves too difficult.

langston-barrett commented 1 year ago

Yeah for sure!

We will need to have some kind of coping strategy if excising uc-crux-llvm of pointee types proves too difficult.

We can always replace pointer types with a pointer to an unbounded array of bytes, UC-Crux will just give you bad results (and give up often).

The original UC-symex paper had a lazy memory model that let them avoid this issue around allocation sizes, but again, that's a big project.

kquick commented 3 months ago

I think another, and possibly more sound approach would be to use the type specified at the usage points rather than the function declaration. Using the previous example:

int f(int *x) {
  return x[5] + 1;
}

the resulting llvm bitcode (from clang 17 with only opaque pointers) is:

define i32 @f(ptr nocapture noundef readonly %0) ... {
  %2 = getelementptr i32, ptr %0, i64 5
  ...
}

From the above, uc-crux-llvm should still be able to infer that the pointer can be interpreted as an array of at least 5 i32 elements and the value of that 5th element may have constraints.

The soundness aspect I referred to earlier is that if the pointer is dereferenced with different types, uc-crux-llvm should be able to identify preconditions for all of the type uses rather than just those for the declared input type.

I realize this is probably a significant extension of the current capabilities of uc-crux-llvm (or perhaps not: I haven't looked at the implementation details), but I think it is probably a reasonable path forward. Let me know if I've oversimplified or overlooked something here, however.

langston-barrett commented 3 months ago

From the above, uc-crux-llvm should still be able to infer that the pointer can be interpreted as an array of at least 5 i32 elements and the value of that 5th element may have constraints.

Note that pointer type information was static, so UC-Crux had an idea of what shape to make arguments before executing the function. While in this simple example one could reasonably propagate the type at the use-site back up to the argument, in general that would require heap-aware data-flow analysis: for example, the pointer coming from the function argument could be stored to the heap, only later to be read out and used in some callee of a callee of the function under analysis.

We could consider dynamic tracking of pointer usages, but this gets weird if e.g., we add a symbolic offset to a pointer before casting it to a particular type.

Regardless, anything based on GEP types would only work until they get rid of GEP.

IMO the correct solution is to view memory as untyped, and to expand allocations when the function under analysis hits a memory error that indicates that an allocation was too small.