Strange term in address failure

kren1 commented 7 years ago

I've come across a possible bug in fuzzball.

Given a program like:

#include <stdint.h>
uint32_t g_54 = 0;
uint32_t g_56 = 3162800460;

static uint32_t magic_symbols[4] = {0,0,0,3};

void main ( void ) {
    g_54 = *magic_symbols;
    if( g_54 < 0 || g_54 > 0) {
        exit(0);
    }
    g_56 ^= 0 < g_54;

    printf("g_56: %u\n", *(&g_56));
}

If I run it with FuzzBALL like so:

#expanded version
fuzzball -solver-path .../stp/build/stp -linux-syscalls -symbolic-string 0x0804a01c+16 ./a.out -- a.out
#or non fixed address one
gcc test.c && fuzzball -solver-path .../stp/build/stp -linux-syscalls $(objdump -D a.out | grep -P "magic_symbols"  | awk '{print  "-symbolic-string 0x"i $1 "+16"}' | tail -n1 |  head -n1) ./a.out -- a.out

I get

Fatal error: exception Failure("Strange term  cast(cast(t2_36409:reg32_t)L:reg8_t)U:reg32_t ^ 0xbc84814c:reg32_t  in address")

Running the same executable natively works fine. Could anyone shed some light on the issue?

Gcc version is: gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5).

I'm also including the actual binary: strangeTerm.zip

smcc commented 7 years ago

Sorry for being slow in responding to this. I looked into it a bit, and what was going on seemed rather complicated, so it's been hard finding time to give you a more complete answer.

Certainly that sort of failure output is something we'd like to avoid happening, but as I've investigated so far, nothing raised by this example is an obvious bug; it's coming from the interaction of a sequence of behaviors each of which is usually OK on its own.

This does mean though that there are a lot of possible workarounds. Perhaps you've already worked around the problem for your own usage, and without context some workarounds might be better than others.

I think my most general comment that gets at a workaround is that part of the reason this weird behavior is happening is that printf is part of the code that FuzzBALL is symbolically executing. This is something that often surprises new FuzzBALL users, because from the source level we often think of printf as being an uninteresting black-box, but because FuzzBALL executes the whole binary and doesn't distinguish between the main program and libraries, the complexity of printf affects FuzzBALL's analysis. In a round-about way, the path that leads to the error message is also specifically related to the fact that printf is printing the result in decimal. So one simple workaround, that often helps performance in other cases as well, would be to use hex or octal printing in printf instead.

Also if the point of the printf is just to see the results of your experiment rather than wanting it to be an aspect of the experiment itself, you could print data with FuzzBALL mechanisms instead. For instance if you know at the binary-code level where a value is stored, you can use the -tracepoint option to print it. This is obviously less convenient to set up, though.

In thinking about this example, it occurs to me that it would be interesting to give FuzzBALL a mode where it replaced calls to printf with outside-the-program printing that could have more control over how it handled symbolic values. That would be close to the best of both worlds for an example like this, but it would be a bit of work to implement. (In part because OCaml's printf has some subtle differences from C's.)

kren1 commented 7 years ago

Thanks for getting back to me on this and pointing me towards the printing in hex workaround, I did not think of printf as the issue.

On a slightly tangential note this Starange term in address error pops up quite a lot, often with different terms (by term I mean something like cast(cast(t2_36409:reg32_t)L:reg8_t)U:reg32_t ^ 0xbc84814c:reg32_t). Do you think they could point to different bugs?

smcc commented 7 years ago

There certainly could be multiple problems that could lead to that kind of message, but my hunch would be that there aren't that many different ultimate causes.

FuzzBALL's brittleness in this area comes originally from the fact that at the binary level, there's no clear distinction between pointer values and integer values. Despite there not being a clear distinction here, FuzzBALL wants to make one in order to support a variety of strategies to use when a symbolic value is used as part of a memory address. When executing a complete program when only making inputs symbolic, it's usually typical that the inputs won't be pointers, since pointers wouldn't have meaning outside the program. But you can also use FuzzBALL to symbolically execute just a single function, so if you make the arguments to an arbitrary function symbolic, some of them might represent pointers. In that sort of context, when a purely symbolic value is used as a pointer, it's not very worthwhile to think of it as ranging over all the possible numeric values of the pointer, because there are a lot of them, but the particular bits of the pointer value don't affect how the code will run. So FuzzBALL has a feature to support this kind of situation that we call a symbolic memory region: the symbolic value is treated as if it is an opaque pointer, and accesses to it populate a memory region with no fixed address, but which is used for all accesses through the pointer.

On the other hand you can also have a symbolic value participating in an address when it is treated like an integer; commonly this arises from indexing into an array or other similar data structure. In this case the number of feasible values of the index is likely smaller, but they may be semantically different, so can either treat the choice between them like a branch, or if it can find a not-too-large bound on the index size, it can symbolically execute all the memory operations at once. The issue that's tricky is how to choose between the symbolic region and symbolic index treatments, when all we really have is different concrete and symbolic expressions going together in the expression for an address. FuzzBALL uses heuristics to try to make this distinction, and they work the way you'd intuitively expect, but in corner cases they can get confused or be inconclusive. The basic strategy is to split the symbolic expression up at the top-level addition operators, giving a bunch of terms. Then FuzzBALL looks at the structure of each term, and classifies it as to whether it's probably an index, or probably a base (pointer) value. It resolves some ambiguities by using the assumption that among all the terms, exactly one of them is a base address; this is analogous to the C type system rules that integer + integer = integer, pointer + integer = pointer, but pointer + pointer = error.

The failure message you quote here comes from running into a case that's not covered by any of FuzzBALL's term classification heuristics. The thing that's weird is why FuzzBALL has to make this classification in the first place, since all the symbolic values in your program are used as integers, not as pointers. I believe this is coming as the result of two choices inside the printf implementation. First, glibc is using a table to convert numbers less than 10 into ASCII digits. This would be overkill if you were only thinking about ASCII and decimal, because the decimal digits are contiguous in ASCII so you can just add 0x30 ('0'). But presumably the glibc developers were making the code more general to support other character sets and locales, or larger bases like hex, where 0-9 and A-F are not contiguous. So somewhere inside glibc, there's code that looks like "digits_table[n % 10]" to compute the ASCII representation of the low decimal digit of a number. So if n is symbolic, this makes a symbolic expression for a memory load.

If the "n % 10" were to show up directly using a modulo operator at the binary level, this would be an easy case for FuzzBALL's classification heuristics, because even when n has a more complex symbolic expression, modulo is not an operator that makes sense to use on a pointer, so FuzzBALL would correctly classify it as an integer. But the other confounding factor is that as an optimization, the compiler doesn't use the CPU's modulo instruction to compute n % 10. Instead it computes it as "n - 10*(n / 10)", and then in turn it computes "n / 10" not using the division instruction, but by using multiplication by 0xcccccccd (which is the closest integer to 2**35/10) and shifting. Because of this representation, "n" shows up as a top level term, so FuzzBALL tries to classify it, and fails.

smcc commented 7 years ago

I've just pushed a couple of changes that may help with this issue.

First, I improved FuzzBALL's formula simplification so that it should now usually be able to recognize that the complex expression produced by the compiler is really equivalent to "n % 10", leaving it as something that the term classification heuristic can handle.

Second, I've implemented a "-no-sym-regions" option, which you can use to disable symbolic regions entirely if you expect your program doesn't need them. At least in the simplified version of your example you could use this, and it avoids even running the term classification heuristics.

kren1 commented 7 years ago

Thanks for taking further time for looking into this. I don't really understand what is the use of no-sym-regions in this case as I have some symbolic bytes.

Now that I was able to print values with %X in printf, I have another question. Given a slightly modified version of the code above:

#include <stdint.h>
#include <stdio.h>
uint32_t g_54 = 0;
uint32_t g_56 = 3162800460;

static uint32_t magic_symbols[4] = {0,0,0,3};

void main ( void ) {
   g_54 = *(uint8_t*)magic_symbols;

   g_56 ^= 1 < g_54;
   printf("g_56: %X\n", *(&g_56));
//   printf("g_54: %X\n", *(&g_54));
}

which basically removes the constrains and make only 8 bits of g_54 symbolic, I get:

g_56: BC84814C
g_56: BC84814D

which is kind of what I would expect. Fuzzball explores two branches depending on weather g_54 is greater than 1 or not. However if I uncomment the last line, fuzzball explores 256 paths, for all possible values of g_54. I understand that fuzzball executes printf and the syscalls symbolically, so those procedures could branch on each bit. However I'm not happy with this explanation, because that would mean the same would happen for the case where only g_56 is printed. Would you be able to better explain this behaviour?

On a slightly tangential note, is there good way of getting the values of g_54 and g_56 at the end of execution. You mentioned tracepoint , but that requires some sort of vine expression and I'm not sure how to get that. Alternatively is there a way to make fuzzball concretize a variable ie. ask the solver for a solution within current path conditions and than treat that variable concretely I noticed there is a make divisor concrete flag, which would kind of work, but that seems to then blowup the number of paths. I would like to see just one path being explored for that variable after conretization.

smcc commented 7 years ago

You ask several good questions here, but in the interests of not delaying too much more, let me try to answer them separately.

The basic reason why FuzzBALL doesn't explore 256 different paths when only g_56 is printed is that FuzzBALL only ever explores feasible paths. Because "1 < g_54" can only ever be 0 or 1, the two execution paths are lead to printing BC84814C and BC84814D are the only paths within that call to printf that can happen. There are branches where printf decides whether to print other characters, but FuzzBALL can see that, for instance, printing the first digit as "C" would be inconsistent with what it knows about g_56, so it prunes that execution path. Some pruning happens with you print g_54 too, since only 256 paths are explored instead of 2**32. Does that help on that point?

kren1 commented 7 years ago

Yes, I see now how that's consistent. Thanks for your answer.

smcc commented 7 years ago

I'll admit that printing values with -tracepoint is less convenient than writing printf() in your source code (when you have source code), but the Vine IR aspects of it shouldn't really be too intimidating. The more generally inconvenient thing is figuring out where the things you want to print are in your program, but that's sort of inherent to binary analysis.

The two pieces of information you need for -tracepoint are a point in execution represented by an instruction address, and a register or memory location. In this case since g_54 and g_56 are globals, they are in memory and you can get their addresses from "nm". For instance in my sample binary they are at 0x0804a038 and 0x0804a020 respectively. As for a program location that represents the end of execution, a reasonable choice for this example is the "ret" instruction at the end of main, though that wouldn't work if the program called exit() elsewhere, etc. Running "objdump -d" on my sample binary I can see that that is at 0x08048466. So the -tracepoint options I need are -tracepoint 0x08048466:'mem[0x0804a038:reg32_t]:reg32_t' and -tracepoint 0x08048466:'mem[0x0804a020:reg32_t]:reg32_t'. That prints:

At 08048466, mem[0x0804a020:reg32_t]:reg32_t is  cast(! (input0_0_264:reg8_t <= 1:reg8_t))U:reg32_t ^ 0xbc84814c:reg32_t
At 08048466, mem[0x0804a038:reg32_t]:reg32_t is cast(input0_0_264:reg8_t)U:reg32_t

The single quotes I put around the Vine expression follow the rules of your shell, but you usually need them because square brackets and spaces are both special to a Unix shell. If the memory locations were a different type, you'd change the type after the "mem[...]" close square bracket, say mem[0x0804a020:reg32_t]:reg8_t for a char. Registers are named like "R_EAX:reg32_t" for %eax, so you can use that to see the return value. You can refer to values on the stack by using %esp or %ebp to construct an address; for instance a location that is referred to as 4(%esp) in AT&T assembly syntax corresponds to mem[R_ESP:reg32_t+4:reg32_t].

kren1 commented 7 years ago

Thanks for writing a short tutorial on how to use -tracepoint. I think I'm able to use it now. However, I don't think it does what I would like it to do. It seems to be printing the constraints on the variable i.e. it's telling me the symbolic expressions for g_56, g_54 if I'm reading it correctly. What I would like to see is an actual value that can be associated with those memory locations., to generate a test case for example. Is there a way to do that?

bitblaze-fuzzball / fuzzball

Strange term in address failure #20