HolmesProcessing / Holmes-Totem

Investigation Planner for fast running analysis with predictable execution time. For example, static analysis.
Apache License 2.0
29 stars 18 forks source link

Cfg angr service #167

Closed gqinami closed 7 years ago

gqinami commented 7 years ago

This service will use the Angr python framework to generate the Control Flow Graph of a binary in JSON format. It uses also the angr-utils utilities for angr, in order to have more information on the graph nodes.

The docker file deviates a bit from the template only in the image part. I could not resolve the dependencies for angr, and specifically, I had problem with libVex when using the alpine image, so I used the python 2.7 image.

webstergd commented 7 years ago

Thanks for the contribution. I will have someone test it shortly. Looks pretty solid though, I only had a few comments.

Also, no big deal on the alpine part. Alpine is awesome if you can use it. Makes loading fast and the container small. But if it doesn't work, it doesn't work.

Remaining questions: 1) what is the size of your output for an object that is around 2megs? 2) what is the size of your output after gzip compression? 3) how long does this take to run on average?

gqinami commented 7 years ago

@webstergd Thanks for the comments, will fix them in my next commit today.

I could not reply to your comment, so I am writing my answers here. As a general note to your questions, now that I measured the output size, I see that there might be an issue with the current approach I took, because I am using angr-utils to output the CFG to have all the information I want, but they include a lot of other information for pretty printing that we will not need (i.e. a lot of HTML tags for formatting the output into tables). The other option is to have the CFG directly from angr, but there in the nodes I have just the basic block memory address, that is really not enough for our purpose. What I suggest, is that you wait on the testing phase, so that I can check one more time if I can get the data I want in a more condensed way. In order to answer your question, here are answers for each of them:

  1. The output for a binary around 2megs would have the size 30 megs.
  2. The size of this output after gzip compression would be around 3 megs.
  3. For a binary with size 16,4 KB, it takes around 5 seconds.

Is it ok for you to wait on the testing, or should I cancel this PR and create a new one around the afternoon?

gqinami commented 7 years ago

Update:

I changed slightly the approach for generating the CFG. Instead of using the angr-utils library, now I am using only the angr library and getting all the data from there. It is slightly more manual approach, but by doing so, I keep only the information that is needed in the output of the service. Here are some numbers for two binaries I tried it with:

  1. 16.3 KB binary

    • Output size: 137,5 KB
    • After gzip size: 14,1 KB
    • Run time (on my laptop): 0,5 sec
  2. 3,5 MB binary:

    • Output size: 26,4 MB
    • After gzip size: 3,5 MB
    • Run time (on my laptop): 103 sec
webstergd commented 7 years ago

@Ma-Shell is going to do a code review shortly.

I have two concerns: 1) The time it takes to run is fairly high. For now, it is probably fine as a totem service. Eventually, we will need to look at how to speed it up or convert it to a totem-dynamic service. I know you are doing this as part of your thesis. So, stick with totem for now and treat this as a future project note. 2) The output size is a little large. 3.5 will probably work. However, results after gzip compression should ideally be under 2 megs. Can you trim any more fat off those results? Mind pastebin'ing a sample result?

cynexit commented 7 years ago

Error during startup, please take a look at it @gqinami :

mainframe ../totem/services/cfgangr ✔ docker run -p 8080:8080 -v /tmp:/tmp:ro totem_angr                                                                                                [2:53]
WARNING | 2017-08-03 00:55:04,351 | claripy | Claripy is setting the recursion limit to 15000. If Python segfaults, I am sorry.
Traceback (most recent call last):
  File "cfgangr.py", line 101, in <module>
    main()
  File "cfgangr.py", line 94, in main
    server.listen(Config["settings"]["port"])
KeyError: 'settings'
mainframe ../totem/services/cfgangr ✗ 
gqinami commented 7 years ago

@webstergd Thanks for your comments and sorry for the late reply. I think I could remove a couple of parameters for each node, but first I would like to confirm it also with my thesis supervisor. I will try to meet him today, and today afternoon I can do the necessary changes. Anyways, for now, here is the pastebin for one CFG: https://pastebin.com/QMZUPSxG

@cynexit I committed my fix to the error. Hope it works now.

webstergd commented 7 years ago

yeah you dont have much fat there.

webstergd commented 7 years ago

only option would be cap the size of the graph. But 3.5megs isn't huge. We can probably try it out and see how it works.

btw the issue with size is that the information needs to be delivered to the processing nodes for every element you are studying. Network IO is expensive. So if you are studying 1 million objects, 1 meg add up pretty fast. Additionally, the backends will move the data around between the distributed nodes. Cassandra seems to handle anything up to 2 megs with little trouble. Odd behavior starts to appear when the result size hits about 5megs. At least that is what we have observed.

gqinami commented 7 years ago

Hi @webstergd!

I had a discussion with my thesis advisor as well, and I managed to remove another attribute from nodes and one attribute from edges. Other than that, I think I cannot remove any additional information since we will need it. Regarding capping of the size of graph, if it is possible, I think we should try it with the current approach, and if things do not work out, I can cap the size of the graph.

Currently, for a 3,5 megs binary, the gzipped json has the size 3.3 megs.

Now I do not plan any changes on the code, so I will be waiting for your side to test it and merge it if everything is fine, or let me know if there are any errors.

Thanks

webstergd commented 7 years ago

sweet. lets give it a go and see how it works. Good luck on your thesis

cynexit commented 7 years ago

Alright, I did a test run on the service. It now boots up correct and seems to accept requests ( 🎉 )! But I never actually got a valid result back. I tried various PE files from our maleware repos and always received

{"error": "Traceback (most recent call last):\n  File \"cfgangr.py\", line 46, in get\n    data = CFGAngrRun(fullPath)\n  File \"cfgangr.py\", line 37, in CFGAngrRun\n    data = convertbinary.generateCFG(binary)\n  File \"/service/convertbinary.py\", line 10, in generateCFG\n    project = angr.Project(binary, auto_load_libs = False)\n  File \"/usr/local/lib/python2.7/site-packages/angr/project.py\", line 146, in __init__\n    raise Exception(\"Not a valid binary file: %s\" % repr(thing))\nException: Not a valid binary file: (u'/tmp/xxxxx', u'xxxxx')\n"}

What files are expected exactly? Can you provide a sample binary that produces the output you're looking for @gqinami ?

gqinami commented 7 years ago

Hi @cynexit ,

According to angr documentation, it should support "ELF, PE, CGC and ELF core dump files, as well as loading binaries with IDA and loading files into a flat address space". (https://docs.angr.io/docs/loading.html)

I did my tests with two binaries, and it worked fine. I will upload one of them here (zipped). I will try to research again on my end what the possible problem could be.

gcc_coreutils_64_O0_make-prime-list.zip

cynexit commented 7 years ago

Thanks for the fix, works now and the results look good.