eurecom-s3 / symcc

SymCC: efficient compiler-based symbolic execution
http://www.s3.eurecom.fr/tools/symbolic_execution/symcc.html
GNU General Public License v3.0
776 stars 135 forks source link

LD_PRELOAD support? #6

Closed adrianherrera closed 4 years ago

adrianherrera commented 4 years ago

Hey SymCC devs!

I have a question; hopefully it makes sense.

I tried out SymCC on libxml + xmllint. However, I couldn't get a symbolic input file because libxml doesn't call open/fopen directly; rather, it uses a libz wrapper around open/fopen instead. I assume that this means that I need to go and recompile libz with SymCC.

However, is it possible to compile the libc wrappers as a shared object that can intercept libc functions (when loaded with LD_PRELOAD), which will capture the above use-case without needing to recompile libz? Or will this not work, because libz still needs to be instrumented to propagate the symbolic data back to libxml?

sebastianpoeplau commented 4 years ago

Hi @adrianherrera!

The easiest way is indeed to compile libz with SymCC.

An LD_PRELOAD interceptor certainly makes sense, and we experimented with it in the initial design. However, there is one significant challenge: our symbolic analysis runs within the same process, and it needs access to the unmodified libc functions. For example, when the QSYM backend writes new test cases, we want it to call regular open/fopen. (It would still have been possible to call the original functions, but calls to things like malloc are often deeply hidden inside library code, and we didn't want to go change them all.) Therefore, the libc wrappers have different names in the current implementation (e.g., the wrapper for open is called open_symbolized), and the compiler rewrites calls to those functions. This way, we have the original functions available for our own code, whereas analyzed code always calls the wrappers.

Additionally, you're right that symbolic data would most likely get lost inside (uninstrumented) libz even if the libc wrapper was in place because we wouldn't see computations that the library might perform before exposing the data to your application.

Hope that helps!

adrianherrera commented 4 years ago

Thanks for the response @sebastianpoeplau!

Yeah, I saw the code for the function rewrites, which is why I guessed I had to go and recompile libz with symcc. I )think_ that you could use static variables as flags with LD_PRELOAD, but maybe I just haven't thought through the implications of this properly.

Ok cool, I will go and recompile libz with symcc and see how much of a rabbit hole this becomes :)

Thanks again for the response (and the awesome tool + paper!).