Closed b-paul closed 2 months ago
To clarify the approximation in the taint analysis, this runs the taint analysis separately for each each input variable being tainted, and treats taint as a fixed gamma value, so that taint-flow is security-level information-flow. It over-approximates towards high (where?), so for each variable the ensures condition ensures either low, equal to the input variables which taint it = possibly high.
Yeah, the initial implementation would perform a taint analysis for each input/global variable, where taint is the variable's gamma at the start of the procedure. Doing this over every variable tells us precisely which input gammas affect the gamma of each output variable. The current implementation does this more efficiently though, instead working over a powerset lattice to determine the set of input variables which taint an output variable, in a single analysis.
I'm not 100% sure what you mean by over-approximates to high, are you referring to how the results can miss cases where a gamma is definitely high depending on the input (For example int f(int x, int y) { return x + y; }
and int f(int x, int y) { return c ? x : y; }
generate the same ensures clause, however the first one could be saying ensures Gamma_out == Gamma_x && Gamma_y
)?
This change improves the performance of variable dependency analysis by breaking the call graph into its strongly connected components using Tarjan's algorithm, and analyzing those individually. We get a reverse topological sort of the components from Tarjan's algorithm, which allows us to analyze our procedures in an order such that we can get "summaries" of the analysis for all procedures called by a particular procedure, which significantly reduces duplicated work. (Unfortunately the cntlm-new binary has a really big strongly connected component in it which makes this analysis still infeasible to run on the whole binary)
(Credit to @ j-tobler) Ensures clauses might be able to be rewritten as $(\bigwedge_{v \in \text{taint}(out)} \Gammav) \Rightarrow \Gamma{out}$, the gamma of the output variable is lower than the join of the gammas of the tainters.
There seems to be room for optimisation here in relation to which variables this generates summaries for. In some of the tests, it will generate summaries for global variables even when the procedure does not modify memory at all, or only modifies memory because it calls rely (which just says that it does not modify memory).
I'm a bit confused about why it even generates summaries for the globals x and y in the get_two procedure in the function_summary test. get_two doesn't at all access those globals, or even memory at all, so why does the taint analysis say they're tainted?
The summaries for function_summary3 seem incorrect. For f we get:
ensures gamma_load32(Gamma_mem, $x_addr);
ensures Gamma_R1;
ensures Gamma_R0;
and for g we get:
ensures (gamma_load32(Gamma_mem, $x_addr) || (gamma_load32(Gamma_mem, $x_addr) == old(gamma_load32(Gamma_mem, $x_addr))));
ensures ((Gamma_R0 || (Gamma_R0 == old(Gamma_R1))) || (Gamma_R0 == old(Gamma_R0)));
Neither of these procedures modify memory at all, so I'm unclear on why summaries for the global variable x are generated.
For f, how is it that R1 and R0 are ensured to be low? Shouldn't R1 and R0 both have security levels that are dependent on the input security level of R0?
These issues may be related to this example using a non-returning call which is not really properly supported at present.
I'm a bit confused about why it even generates summaries for the globals x and y in the get_two procedure in the function_summary test. get_two doesn't at all access those globals, or even memory at all, so why does the taint analysis say they're tainted?
You're right in that get_two shouldn't need to say anything about x and y. The reason it mentions those is because x and y are initially marked as tainting themselves, so at the end of the procedure they still taint themselves. It shouldn't be too hard to add a check for modifying memory (hopefully) though
For f, how is it that R1 and R0 are ensured to be low? Shouldn't R1 and R0 both have security levels that are dependent on the input security level of R0?
That is concerning, it certainly shouldn't be doing that :grimacing: I'm confused by how that even passed the test to be perfectly honest... My first guess was that it is something to do with the summaries generated by the strongly connected components, or something like that
I'll hopefully find time to work on this soon
The test passes because the 'assume false' (generated by the non-returning call to g() in f()) allows for any post condition of f() to be verified. That's part of what I mean by non-returning calls not being properly supported at the moment.
The reason it mentions those is because x and y are initially marked as tainting themselves, so at the end of the procedure they still taint themselves. It shouldn't be too hard to add a check for modifying memory (hopefully) though
Yes, it seems important to handle the case where a variable is not actually modified at all. It is probably difficult to do this robustly for now but at the very least it should be trivial to check whether memory is modified at all, and the memory region analysis should allow this to be improved in the future.
I'm just going to clean this up and merge it in, putting it behind a flag for now.
Generates ensures clauses about the gammas of output variables in procedures using a taint analysis (ish). As an example
we know that the return value of this function could be affected by
a
orb
, so we sayensures Gamma_out == Gamma_a || Gamma_out == Gama_b || Gamma_out == true
.This analysis does not currently scale to run on the
cntlm-new
binary. This is probably because of the time complexity of the IDE solver.