eth-sri / debin

Machine Learning to Deobfuscate Binaries
Apache License 2.0
412 stars 62 forks source link

Preparing a binary for training #3

Closed urialon closed 5 years ago

urialon commented 5 years ago

Hi again, I have another question which I couldn't understand from the README: What should we do to prepare a binary for training? Assume that I have a binary that was compiled with debug symbols. In order to train on this binary, I need to have two versions of this file, in --bin_dir examples/stripped/ and in --debug_dir examples/debug/.

What should I run on the binary to create each of the two versions? For example, I noticed that the version that is in `example/stripped' is not completely stripped.

Thanks!

urialon commented 5 years ago

Hi, I managed to get it to work by running objcopy --only-keep-debug a.out a.dbg to get the debug info file, and strip -s a.out -o a.stripped to get the stripped file. Is it the correct usage?

LostBenjamin commented 5 years ago

First, the binary pair should be of the same file name but put into different folders. So you will have examples/stripped/a and examples/debug/a finally. In examples/bin_list.txt, there should be a line a to refer to the pair.

Second, you should keep symbol table for the "stripped" version. Symbol tables contain scope and name for functions. When training or evaluating prediction accuracy by py/evaluate.py, we assume function scope is known for every binary. However, golden function names in symbol table are not used as extra information for prediction. With your command strip -s a.out -o a.stripped, function scope is inferred by BAP, which may be imprecise. As a result, training sample labelling and accuracy measurement may be wrong.

urialon commented 5 years ago

Thanks for your quick reply. 1.Regarding file names - Yes, sure, I did it as in the example as you said.

  1. Regarding symbol table - so what are the correct command lines?

  2. Can you elaborate on function names? Why are they not used in evaluation and how is it relevant for the scope?

Thanks!

LostBenjamin commented 5 years ago
  1. You should run strip -g a.out -o a.stripped.

  2. Sorry, what I wrote about this part was misleading. What I meant is that, when training or evaluating accuracy by py/evaluate.py, we only assume function scope is known and golden function names in symbol table are not used as information for prediction. We of course need to compare predicted function names with golden function names to calculate accuracy. (I also edited the comment before in order not to be misleading)

urialon commented 5 years ago

Thanks!

urialon commented 5 years ago

Maybe you guys would want to add the objcopy and strip -g instructions to the README, for future reference. Thanks again

LostBenjamin commented 5 years ago

Yes, we will add those. Thanks for the suggestion.

mingfure commented 5 years ago

Hi, According to the content of Linux Symbol Packages,I got the correct format non-stripped binaries. But from /usr/bin or /bin, the corresponding stripped binaries has no .symtab. So how could I do to extract the .symtab and add it to the stripped binaries?

LostBenjamin commented 5 years ago

Hi,

You can use ELFIO library to read those sections from debug information and add them to the stripped binaries.

Best, Jingxuan