Open ccasin opened 2 years ago
To see if I'm understanding this correctly, I'm going to write an example. Please correct me if I'm wrong anywhere. Say we have two programs that are identical and look like this:
void* foo(size_t size) {
int* ptr = malloc(sizeof(int));
*ptr = 1;
free(ptr);
return malloc(size);
}
I will refer to the first malloc
as malloc1
and the second malloc
as malloc2
. At the start of the analysis, we have the postcondition and the allocator state called allocator_state
. When WP reachs the malloc2
in the modified binary, two things change:
RAX
in the postcondition with malloc_result_model(RDI, allocator_state)
.malloc_state_update(RDI, allocator_state)
and add the following constraint, malloc_state_update(RDI, allocator_state) != allocator_state
.Now, we reach malloc1
in the modified binary. The changes in this case will be:
RAX
with malloc_result_model(RDI, malloc_state_update(RDI, allocator_state))
malloc_state_update(malloc_state_update(RDI, allocator_state))
and add the constraints:
malloc_state_update(malloc_state_update(RDI, allocator_state)) != malloc_state_update(RDI, allocator_state)
malloc_state_update(malloc_state_update(RDI, allocator_state)) != allocator_state
Now, we visit malloc2
in the original binary. This is the part where my understanding is starting to get iffy.
allocator_state
?malloc_state_update(RDI_orig, allocator_state)
and malloc_state_update(RDI_mod, allocator_state)
return the same new state?Other questions I have are:
free()
in the binary?I think the easiest way to explain this is to say we're going to make up a new register, ALO
, which holds the allocator state. And ALO
is going to get updated at memory management calls just the same way we update RAX
to account for their return value.
So, for example, you said:
When WP reachs the
malloc2
in the modified binary, two things change:
- We replace all locations of RAX in the postcondition with malloc_result_model(RDI, allocator_state).
- We keep track of our new allocator state which is represented as malloc_state_update(RDI, allocator_state) and add the following constraint, malloc_state_update(RDI, allocator_state) != allocator_state.
This is right, and the way we "keep track of our new allocator state" is to do what you did for RAX: replace all locations of ALO
in the postcondition with malloc_state_update(RDI, ALO)
. I'm not sure we actually need to add the constraint ALO != malloc_state_update(RDI,ALO)
- I think I would try it without that first.
So then when we reach malloc1
in the modified binary, we update RAX
and ALO
again in the same way. I'm a little shaky with what's going on with RDI
here, but for some RDI'
, this will mean we again replace ALO
in the postcondition with malloc_state_update(RDI', ALO)
. And so now if there were any uses of ALO
in the original postcondition, these two substitutions combined have turned them into malloc_state_update(RDI, (malloc_state_update(RDI', ALO))
You asked about how we relate the two programs. The idea here is we have init_ALO_orig
and init_ALO_mod
just like for any other register, and we assume these are equal, and also that init_ALO_orig = ALO_orig0
and init_ALO_mod = ALO_mod0
(or whatever the names end up being).
So, when we look at RAX
after malloc1
in the two programs, one will be something like malloc_result_model(RDI_orig0, ALO_orig0)
and the other something like malloc_result_model(RDI_mod0, ALO_mod0)
. But Z3 will be able to work out that these arguments are equal and therefore the results are equal, because of our standard assumptions about registers.
If that doesn't make sense, let's meet up and work through it in more detail - it will help to work some examples out looking at what precondition we're generating with the current model.
To answer your other questions:
I think it would work fine single program analysis (though I'm not sure it has many benefits over the current model for that case). Probably we can use it to keep things uniform.
My instinct is to add it as an option, but I could be talked out of that if it's going to be a real pain to check whether the option is turned on in many places in the code.
We should treat free
(and basically all libc memory management funtions) the same way. So free
will get an extra argument in the model for the allocator state, and we'll have another uninterpreted function free_state_update
to model how it changes the state. This model isn't detailed enough to capture the idea that free
returns a specific pointer to the allocator state (but we could look into ways to capture some of that detail in the future).
The current default model for
malloc
simply assumes that it generates a fresh result every time it is called.This is inconvenient for comparative analysis, because if the two programs have an "identical" call to malloc, we currently assume each call gives a different result. This causes false positives where we think the programs differ because of this different result value or, more subtly, because the memories differ after writes through those pointers.
What to do? We certainly shouldn't assume
malloc
's result is determined by its arguments, as it is stateful. In the real world,malloc
's result is determined by its arguments and a bunch of allocator state stored in memory. So it would be technically accurate to saymalloc
's result is determined by its arguments and the current state of memory, but this doesn't really work for CBAT because we aren't modeling all that allocator state in memory. Indeed, if the model ofmalloc
doesn't also change memory, this model would result in two consecutive calls tomalloc
returning the same results.In principle we could build an accurate model of libc's state in memory and dutifully track the ways
malloc
changes it. But that's very time consuming (and would need to be redone for each libc implementation) and also probably extremely inefficient for analysis.So, how about this: Model
malloc
with a pair of uninterpreted functions:malloc_result_model
, which is likemalloc
but with an extra integer parameter representing the current allocator state, andmalloc_state_update
, which takes the same arguments and models the change to the heap state whenmalloc
is called.The high level idea is for
malloc_result_model
's extra allocator state parameter to be different at each call tomalloc
within one program, but the same between the two programs in comparative analysis. Of course, we don't actually have a way to "match up" calls tomalloc
in the two programs, but perhaps it's good enough to just assume that the heap state is the same at the start of the two functions being analyzed. The functionmalloc_state_update
returns the new allocator state value for the next call to malloc.I think would be an improvement to our current model. We still keep the idea that each consecutive call to
malloc
returns a different result (because the state parameter changes), but in a way that is deterministic enough so that we can get the same results in comparative analysis. I can, however, see two potential issues, which maybe we can ignore for now:1) This model pretends that the allocator state is totally separate from memory. But of course this allocator state parameter is a model for a whole bunch of data that malloc relies on, which is stored at various different places in memory. Writes to memory don't change the allocator state in our model, but might in the real world. This is a potential source of false NEGATIVES.
2) This model allows the allocator state to change after each call to malloc, but doesn't require the allocator state to be a fresh unique result every time. This is a potential source of false positives, where malloc in this model can return the same value more often than is really the case. On the other hand, if we somehow added the assumption that each change to the allocator state results in a truly unique result, that would also be wrong, and in a way that can cause false negatives.