Closed steseb closed 4 years ago
The thing you're trying to do is incredibly weird, and I'm not particularly sure what your usecase is, but the only advice I can think to offer is that that code will only get triggered when there's a constraint solve. In malware samples, there could be antianalysis traps, such as the use of exception handling for control flow (angr supports it but it's very complicated and not on by default) which would prevent you from ever reaching a state where there are real constraints, as opposed to trivially true/false, generated.
Thanks a lot for your advice @rhelmot! Could you please provide me a pointer for turning on this angr
feature?
I totally agree with you that if there is no constraint the solver will not be called, but I do not expect that this will be the general scenario for a malware dataset of over 2000 samples.
The idea is to extract a set of constraints in smt2 format and then compare the performance of different solvers. What I do not expect is the behaviour of angr
re-running it several times on the same binary: (1) it could happen that sometimes the smt2 file is generated while other times not? (2) And still with the same malware binary, __batch_eval()
could print out different smt2 files? How do you explain this "unstable" behaviour of angr
?
And in general, is __batch_eval()
the right point to catch all the calls to the constraint solver even if the solver will end up in a timeout, exception, etc?
Turning on exception handling is as easy as adding the STRICT_PAGE_ACCESS
and EXCEPTION_HANDLING
state options. The actual implementation of what happens on an exception is only implemented for windows, where we have a bare minimum implementation of the SEH interface, but it worked well when I was doing malware analysis. Look at the windows SimOS if you want to see how to implement it for some other platform.
I disagree with your assessment that it would not be the general scenario for malware - antianalysis is extremely common.
Generally, "nondeterministic" behavior in angr comes from hash nondeterminism, where the hash values of objects can change based on the python implement or possibly hash randomization being enabled, so when we iterate over dictionaries or sets we get an unknown order of events. I don't think that would cause the issues you're describing though. If you give me a sample and a script I can look into it for you.
__batch_eval
seems like a good place to put this instrumentation. @zardus would know the best answer.
A lot of stuff happens between sim_manager.run()
and BackendZ3._batch_eval
. This can include (but isn't limited to):
You can mitigate the first issue by removing COMPOSITE_SOLVER
from state options (using the remove_options
argument, or by also calling state.solver.reload_solver()
after modifying state.options
). This will at least keep the constraints together.
You can mitigate the second issue by adding CACHELESS_SOLVER
to state options (using the add_options
argument, or by also calling state.solver.reload_solver()
after modifying state.options
).
There's nothing that can be done with the third issue --- angr can do weird stuff with the constraints any time it wants, and as batch_eval
is basically called for any solve it does, for whatever purposes, you can't expect the constraints to be exactly those constraints on your state.
If you want constraints that make sense for the states involved, I would recommend creating a ConstraintDumper
exploration technique that dumps constraints at every step. You can use backend_z3.convert
to convert the full set of constraints (state.solver.constraints
) to Z3 and dump them to SMT, for each state at every step, independent of the issues above and whatever optimizations and weirdness angr and claripy do during the solving process.
The last thing was something I had considered doing, but I though it wouldn't satisfy the "must work even in the case of timeouts or errors", HOWEVER, I just realized you could probably get around this by adding the LAZY_SOLVES
state option - then you'll be able to see constraints at every step regardless of if the constraints are unsat or if they'd timeout the solver.
Ahhh... Here's what I would do, then, @steseb: set a siminspect breakpoint (when=BP_AFTER
) on the constraints
action, turn on LAZY_SOLVES
, and dump the smtlib file in your breakpoint. LAZY_SOLVES
will prevent the sat-check (and possible timeout) if the constraint is an exit guard, and you'll be able to dump all the state.solver.constraints
every time that a new constraint is added.
There's still one case that this misses: constraints used during memory concretization before they're added to the state. If a timeout or error occurs there, you'll still miss it...
Thanks for your precious suggestion @rhelmot and @zardus! Unfortunately, I can not share the malware samples since are under the NDA. But I will perform more tests following your advices and let you know if my issue persists.
I don't know how to help with this. I guess just try to look deeper and deeper into the angr stepping process (as explained here) and see where things start happening differently in different executions?
@steseb Did you ever make headway with this? I too could benefit from this feature.
The specific title of this issue had already been solved, and has since been even more solved: you can say from claripy.backends.backend_z3 import claripy_solver_to_smt2
and use that method with a claripy solver object to dump its constraints as an smt2 testcase. The thing being deliberated in this issue is a weird philosophical question about at what point in analysis you should do this in order to recover constraints, which honestly I don't particularly care to deliberate further. There are many answers on when to do it. They are all right.
@rhelmot I've got a proj.factory.simgr over a proj.factory.full_init_state like so:
proj = angr.Project(sys.argv[1])
state = proj.factory.full_init_state()
# map some state.mem[...] to things
simgr = proj.factory.simgr(state)
simgr.run()
exits = simgr.deadended
# map some exits[0].mem[...] to things
Can I apply claripy_solver_to_smt2
to state.solver
? Doesn't seem to work out of the box due to frontend vs backend.
state.solver
is a state plugin, not a claripy solver. It holds a claripy solver as state.solver._solver
.
state.solver._solver
appears to be a SolverComposite. Calling claripy_solver_to_smt2(state.solver._solver)
gives AttributeError: 'SolverComposite' object has no attribute '_get_solver'
.
I am trying to extract all the constraints passed to the z3 solver even in case of timeout, exception, etc. I thus considered the
backend_z3
and in particular the__batch_eval()
function. I want the output in a smt2 syntax so that I can then process the extracted constraints with the standalone version of z3.I thus added the following code in
__batch_eval()
just before the end of the function.I tried it with a set of binaries and it works pretty well. The extracted smt2 files can be easily used with the standalone version of z3.
Unfortunately, extending my test with a large dataset of real malware, I found that I do not always have a consistent behaviour. A large number of binary (about 75%) do not produce any smt2 file. In other cases, when the smt2 file is generated, re-running angr on the same binary: I could have no file or the produced smt2 files differ significantly.