R-Fuzz / symsan

A LLVM Sanitizer for Symbolic Tracing
Apache License 2.0
207 stars 29 forks source link

Some questions about code modification #21

Closed lz1159435992 closed 4 months ago

lz1159435992 commented 1 year ago

We want to use deep learning methods to speed up the solution of constraints on symsan, so there are some implementation-related issues. 1.What are the differences between the code structure of symsan compiled in the fastgen repository and that in the symsan repository? 2.How to obtain the constraints collected by symsan? What are the related APIs? 3.Can I choose to concretize variables during symbolic execution, and what are the related APIs? Would this violate consistency? 4.If we only focus on the part of the code related to constraints and execution paths during symbolic execution, which part should we specifically pay attention to?

ChengyuSong commented 1 year ago
  1. fastgen repo contains a fuzzing driver, this does not (yet)
  2. there's no API for retrieving constraints, you can check the fgtest code and comments in the code to see how to access. this might be helpful too https://chengyusong.github.io/fuzzing/2020/11/18/kirenenko.html but we don't have time to document thing yet
  3. this is a good question, we don't have the API for concretization. you have two options, either modify __taint_union to permanently concretize during constraint collection, or you can concretize later during solving. I'd recommend the later as concretization won't affect the perf of collection. for consistency, that's depends on your concretization strategy
  4. you need to understand the difference between symbolic and concolic execution. if you want to understand the logic, check the TaintPass.cpp
lz1159435992 commented 1 year ago

Thank you very much for your response. It looks like I'll need to spend more time reviewing the content mentioned in your answers. I have two more questions after reading the source code:

  1. The documentation only provides methods for testing using 'lit tests.' If I want to perform symbolic execution on a cpp file using Symsan separately, how should I proceed?
  2. I noticed that there are identical methods, such as 'serialize(),' in 'z3.cpp' and 'fgtest.cpp.' What is the relationship between these two files?
mingjun97 commented 1 year ago
  1. It depends on how your cpp program is built. Simply specify the CC to the built ko-clang would invoke symsan to build your program and the compiled binaries would be the instrumented binary for concolic exectuion.
  2. Symsan decouples the taint tracking (constraint collection) and solving. These two files are actually two different solver implementations, z3.cpp is an in-process solver that invokes z3 in the same process and in a synchronizing manner while fgtset.cpp is an out-of-process solver that anticipated solving constraints outsides the instruments binary thread. In other words, you will only have one of z3.cpp or fgtest.cpp brewed into your binaries at one time.
lz1159435992 commented 1 year ago

I have a new question after obtaining constraints in the form ofz3::expr from the cache_expr method inz3.cpp: if can I associate the constraints represented in the form of z3 expressions obtained from z3.cpp with the source code? For example, within this branch, I can obtainz3::expr expressions, but I can't establish a correspondence between these expressions and the specific portions of the source code. What I need is this kind of associational information. Additionally, since the output of our deep learning method corresponds to variable names in the source code, we also need such associational information to map them to the corresponding variables in the z3 expressions.

72302c5cd57522a2bbcb3e3724dd53c
ChengyuSong commented 1 year ago

in the callback functions like __taint_trace_cond you can use the return address void *addr = __builtin_return_address(0); to map the branch back to the source code