CUMLSec / XDA

78 stars 13 forks source link

How to get instruction boundaries and How to evaluate function boundaries #5

Open 01FreshMan opened 3 years ago

01FreshMan commented 3 years ago

Hi,

Thanks for your excellent work. It is quite inspiring.

However, I got two questions when playing with XDA.

peikexin9 commented 3 years ago

Hi @01FreshMan, thanks for your kind words :-)

(1) I have not uploaded the finetuned model for instruction boundary yet. But I just uploaded the dataset of instruction boundary raw dataset (data-raw/inst_bound). They can be preprocessed very similarly to function boundary data, and finetuned the same way. I can try to upload the finetuned model for instruction boundary and prepare the script to directly query it soon.

(2) Similar to the bi-directional RNN paper, we evaluate the function boundary prediction at the byte level. So we predict each byte with the corresponding label and compare it to the ground truth and measure the F1 score as the metric. So you can imagine each byte within a function is like a single sample, and the F1 is computed based on how the model predicts the boundary vs. non-boundary. After predicting the F/R, we can run a simple matching script to linearly scan the bytes of .text and put the consecutive start and end to form the actual function boundary.

01FreshMan commented 3 years ago

Hi @peikexin9,

Thanks for your prompt reply! I got how to analyze the instruction boundaries. It sounds reasonable to me.

However, I am still kinda confused about the function boundaries.

Specifically, I am not sure how you mark the R label. Considering the following code

@function main
main:
    sub rsp, 0x80
    jmp .L1

.LX:
    <This basic block belongs to other functions>

.L1:
   add rsp, 0x80
   ret

Will you mark jmp .L1 as a function end? It looks unreasonable no matter whether we mark it or not.

Note that, although it is not a common case for GCC/Clang, it is still possible for real-world binaries. Additionally, the case of cold blocks (the much common case I mentioned before) has the same issue.

jonasguan commented 3 years ago

Hey @01FreshMan , I'm a co-author of the paper. This is a great question. This situation you described also occurs quite frequently in MSVC binaries with the use of thunks/trampolines. The straightforward answer is that XDA, like all other static analysis tools, cannot accurately handle detecting the boundaries of functions that are not contiguous (imagine the pathological case of a basic block ending with an indirect jump-- the function could continue anywhere; it's impossible to know for sure without executing the code).

The other part of your question is how our we evaluate our model on such cases. There is unfortunately no agreed-upon rule among tools on how to label these cases; some tools like IDA Pro would count that jmp .L1 as function end, and .L1 as a separate function, others like ByteWeight would not count the first basic block as a function. We follow the latter approach in our evaluation. The binaries in our dataset are not compiled with function splitting, but when thunks/trampolines are used, we do not count them as functions in our ground truth.

ZhangZhuoSJTU commented 3 years ago

@01FreshMan Great question and thanks for setting up the discussion! Another problem of detecting the boundaries of functions is that some functions may share code.

@jonasguan @peikexin9 I am more curious why people care about function starts/ends? Detecting function starts/ends seems not that helpful for both reverse engineering and vulnerability detection. As far as I am aware of, few high-level representation cares about the memory a given function occupies.

For me, I feel that detecting function entrypoints would be more meaningful. Given the entrypoint, people can recover the CFG of the given function (although it may miss some blocks/edges due to those undecidable problems, simply detecting function boundaries would also fail to do so). Additionally, the entrypoint would help much when constructing IR. Would you mind sharing some ideas about this?

01FreshMan commented 3 years ago

@jonasguan Thanks for your answer. However, it leads me to more questions.

Q1: Is your evaluation fair enough for IDA Pro/Ghidra?

There is unfortunately no agreed-upon rule among tools on how to label these cases; some tools like IDA Pro would count that jmp .L1 as function end, and .L1 as a separate function, others like ByteWeight would not count the first basic block as a function. We follow the latter approach in our evaluation. The binaries in our dataset are not compiled with function splitting, but when thunks/trampolines are used, we do not count them as functions in our ground truth.

Like you said, you do not count those separated blocks (e.g., cold blocks, thunks, trampolines) as functions in ground truth but IDA Pro will. Does it mean those separated blocks will always be marked as false positives for IDA Pro?

Or you will post-process the IDA Pro's results and remove its F/R labels of separated blocks? To tell the truth, I do not think there is a safe approach to eliminate such inconsistence.

Hence, I am quite worried about your evaluation results about IDA Pro and Ghidra.

Q2: Why is your dataset not compiled with function splitting?

We follow the latter approach in our evaluation.

In the paper, you mentions that the dataset is compiled by gcc with -O0 to -O3. As far as I know, -fpartial-inlining is enabled since -O2, which will introduce function splitting (see this paper).

So I am confused why you said your dataset is not compiled with function splitting?

Besides, clang/LLVM is known to perform powerful inter-procedural analyses and optimizations, which will introduce aggressive inter-procedural block reordering. Is it the major reason why you do not compile the dataset with clang?

Q3: How do you get the ground truth of function boundaries?

The straightforward answer is that XDA, like all other static analysis tools, cannot accurately handle detecting the boundaries of functions that are not contiguous.

Like you said, there is no accurate approach to detecting the function boundaries (due to the undecidable nature), even with debug information. It makes me much more curious about the accuracy of your ground truth.

So I am wondering how you get the ground truth?

Did you simply read the symbol table and collect the size of each function symbol? It is a very inaccurate approach. For example, for GCC-compiled binaries, each cold block will have a fake function symbol, incurring tons of false positives in the ground truth. Once the ground truth is not accurate, the evaluation results would be hard to interpret.

Or you have some other advanced techniques to collect the ground truth? Would you mind providing the scripts that we can run and test? For me, having an accurate function boundary analyzer (from debug information) is a great contribution to the community.

01FreshMan commented 3 years ago

@ZhangZhuoSJTU You are right. Identifying function entry points sounds more reasonable for me.

jonasguan commented 3 years ago

@01FreshMan Great question and thanks for setting up the discussion! Another problem of detecting the boundaries of functions is that some functions may share code.

@jonasguan @peikexin9 I am more curious why people care about function starts/ends? Detecting function starts/ends seems not that helpful for both reverse engineering and vulnerability detection. As far as I am aware of, few high-level representation cares about the memory a given function occupies.

For me, I feel that detecting function entrypoints would be more meaningful. Given the entrypoint, people can recover the CFG of the given function (although it may miss some blocks/edges due to those undecidable problems, simply detecting function boundaries would also fail to do so). Additionally, the entrypoint would help much when constructing IR. Would you mind sharing some ideas about this?

Sure! I used to work as a malware reverse engineer, so I’ll share a few use cases for function boundaries from personal experience. First, having the boundaries help reverse engineers logically structure the messy assembly for human consumption, and a lot of malware reversing is still done manually. Similarly, this is helpful if you’re trying to decompile parts of the binary. Second, once you have the boundaries, you can use the bytes or instructions of the function to create function-level signatures for malware detection. Malware authors often update their malware to avoid detection, but many core functions remain unchanged, and thus function-level signatures offer a more robust pattern to detect malware.

But of course, as you’ve mentioned, we don’t necessarily have to explicitly recover and store the function ends for these use cases; given the starts, we can also opt to follow the control flow and find the ends on-the-fly. The major benefit of our approach here is speed, since XDA can break large functions into fix-sized chunks for parallelization, and take advantage of GPUs. This is particularly useful when disassembling large numbers of binaries to search for function signatures (e.g., scanning for known malware on a corporate network; constructing the CFGs of a few terabytes of binaries will take a long time!).

Finally, I also want to highlight that the main purpose of XDA is to propose using pre-trained language models to solve binary analysis tasks, and recovering function/instruction boundaries are just two sample downstream tasks we use to demonstrate the language model’s usefulness. To this end, part of our decision to recover function/instruction boundaries is also for easier comparison against other tools, as these are common tasks for benchmarking. Fortunately, our tool can be easily tweaked to only detect function entrypoints if that better suits your needs :)

jonasguan commented 3 years ago

@jonasguan Thanks for your answer. However, it leads me to more questions.

Q1: Is your evaluation fair enough for IDA Pro/Ghidra?

There is unfortunately no agreed-upon rule among tools on how to label these cases; some tools like IDA Pro would count that jmp .L1 as function end, and .L1 as a separate function, others like ByteWeight would not count the first basic block as a function. We follow the latter approach in our evaluation. The binaries in our dataset are not compiled with function splitting, but when thunks/trampolines are used, we do not count them as functions in our ground truth.

Like you said, you do not count those separated blocks (e.g., cold blocks, thunks, trampolines) as functions in ground truth but IDA Pro will. Does it mean those separated blocks will always be marked as false positives for IDA Pro?

Or you will post-process the IDA Pro's results and remove its F/R labels of separated blocks? To tell the truth, I do not think there is a safe approach to eliminate such inconsistence.

Hence, I am quite worried about your evaluation results about IDA Pro and Ghidra.

Q2: Why is your dataset not compiled with function splitting?

We follow the latter approach in our evaluation.

In the paper, you mentions that the dataset is compiled by gcc with -O0 to -O3. As far as I know, -fpartial-inlining is enabled since -O2, which will introduce function splitting (see this paper).

So I am confused why you said your dataset is not compiled with function splitting?

Besides, clang/LLVM is known to perform powerful inter-procedural analyses and optimizations, which will introduce aggressive inter-procedural block reordering. Is it the major reason why you do not compile the dataset with clang?

Q3: How do you get the ground truth of function boundaries?

The straightforward answer is that XDA, like all other static analysis tools, cannot accurately handle detecting the boundaries of functions that are not contiguous.

Like you said, there is no accurate approach to detecting the function boundaries (due to the undecidable nature), even with debug information. It makes me much more curious about the accuracy of your ground truth.

So I am wondering how you get the ground truth?

Did you simply read the symbol table and collect the size of each function symbol? It is a very inaccurate approach. For example, for GCC-compiled binaries, each cold block will have a fake function symbol, incurring tons of false positives in the ground truth. Once the ground truth is not accurate, the evaluation results would be hard to interpret.

Or you have some other advanced techniques to collect the ground truth? Would you mind providing the scripts that we can run and test? For me, having an accurate function boundary analyzer (from debug information) is a great contribution to the community.

Question 1: We did consider this, and they are not marked false positives for any tools we compared against (see Section 6 Subsection Label collection: “We remove thunks and trampolines from the outputs of function recovery tools that count them as functions, to ensure that they are not inadvertently seen as false positives.“); however, we do rely on information from the symbol table, I’ll elaborate below.

Question 2 & 3: Thank you for bringing this up— frankly, I did not know that cold blocks will generate fake function symbols. I had assumed that -fpartial-inlining would simply inline a partial copy of the callee to the caller, making that a part of the caller function; I didn’t realize that it will split part of the callee into a separate outlined function with its own symbol (could you confirm this new understanding is now correct?).

As much as I’d like to contribute a new advanced technique for a function boundary analyzer, I’m afraid I don’t have anything of the sort. We followed the standard practice by [4], [5], [7] and [48], and used compiler-generated debugging symbols as the ground truth for our function boundaries. More specifically, we used the labels/scripts of [4] and [7] when possible (since their datasets are commonly used to benchmark function boundary and entrypoint detection), and extracted our own symbols using Dia2Dump and pyelftools when not (both these tools are freely available online). However, if cold blocks create a problem as prevalent as you suggest, our results may all have been affected. Could you provide some resources we can use to learn more about this?

As for not using clang, the reason is a bit more mundane: we initially knew that we wanted to directly compare our results with [48], which shares the most similarity with our approach (i.e. they also make use of neural networks, albeit a different architecture and without pretraining), so we chose GCC, ICC and MSVC as they are the compilers [48]’s dataset uses. As the project moved forward, we decided to also include other tools for comparison (Nucleus, IDA, Ghidra), but we continued using the same dataset given time constraints.

01FreshMan commented 3 years ago

However, if cold blocks create a problem as prevalent as you suggest, our results may all have been affected. Could you provide some resources we can use to learn more about this?

xgcc.tar.gz

It is GCC compiled by myself (I do not have access to SPEC 2006/17, but I think both SPEC 2006 and 2017 contain GCC). Some information about the attached binary is:

All functions whose names end with .cold are indeed cold blocks (not functions). The readelf result is attached below.

$ readelf -s xgcc  | grep cold
    94: 0000000000403984    20 FUNC    LOCAL  DEFAULT   13 _ZL11save_stringPKci.cold
   157: 0000000000403a71    95 FUNC    LOCAL  DEFAULT   13 _ZL7executev.cold
   209: 0000000000403c4c    10 FUNC    LOCAL  DEFAULT   13 _ZL9store_argPKcii.cold
   219: 0000000000403c56     5 FUNC    LOCAL  DEFAULT   13 _ZL13close_at_filev.cold
   243: 0000000000403c5b    60 FUNC    LOCAL  DEFAULT   13 _ZL9do_spec_1PKciS0_.cold
   341: 0000000000403eb4    20 FUNC    LOCAL  DEFAULT   13 _Z15debug_set_namesj.cold
   639: 00000000004048b8    21 FUNC    LOCAL  DEFAULT   13 _Z10num_digitsi.cold
   925: 00000000004058a3     8 FUNC    LOCAL  DEFAULT   13 elf_zlib_inflate.cold
   946: 00000000004058b5     5 FUNC    LOCAL  DEFAULT   13 xregerror.cold
   983: 00000000004058ba   101 FUNC    LOCAL  DEFAULT   13 cplus_demangle_type.cold
  1002: 000000000040591f     5 FUNC    LOCAL  DEFAULT   13 htab_expand.cold
  1003: 0000000000405924     5 FUNC    LOCAL  DEFAULT   13 htab_clear_slot.cold
  1012: 000000000040592e    22 FUNC    LOCAL  DEFAULT   13 _obstack_newchunk.cold
  1013: 0000000000405944     5 FUNC    LOCAL  DEFAULT   13 _obstack_free.cold
  1015: 0000000000405949    43 FUNC    LOCAL  DEFAULT   13 __cxa_guard_acquire.cold
  1017: 0000000000405974    46 FUNC    LOCAL  DEFAULT   13 _Znwm.cold
  1035: 0000000000405a2d    10 FUNC    LOCAL  DEFAULT   13 __gxx_personality_v0.cold
  1070: 0000000000405b7b   101 FUNC    LOCAL  DEFAULT   13 d_type.cold
  1097: 0000000000405be0     5 FUNC    LOCAL  DEFAULT   13 uw_install_context_1.cold
  1099: 0000000000405be5     5 FUNC    LOCAL  DEFAULT   13 read_encoded_value.cold
  1101: 0000000000405bea    10 FUNC    LOCAL  DEFAULT   13 execute_stack_op.cold
  1103: 0000000000405bf4    10 FUNC    LOCAL  DEFAULT   13 uw_update_context_1.cold
  1105: 0000000000405bfe     5 FUNC    LOCAL  DEFAULT   13 execute_cfa_program.cold
  1107: 0000000000405c03     5 FUNC    LOCAL  DEFAULT   13 uw_frame_state_for.cold
  1110: 0000000000405c08     5 FUNC    LOCAL  DEFAULT   13 uw_init_context_1.cold
  1115: 0000000000405c17     6 FUNC    LOCAL  DEFAULT   13 _Unwind_GetGR.cold
  1116: 0000000000405c1d     6 FUNC    LOCAL  DEFAULT   13 _Unwind_SetGR.cold
  1119: 0000000000405c28     5 FUNC    LOCAL  DEFAULT   13 _Unwind_Resume.cold
  1121: 0000000000405c32     5 FUNC    LOCAL  DEFAULT   13 _Unwind_Backtrace.cold
  1136: 0000000000405c4c     5 FUNC    LOCAL  DEFAULT   13 add_fdes.cold
  1138: 0000000000405c51     5 FUNC    LOCAL  DEFAULT   13 linear_search_fdes.cold
  1148: 0000000000405c60     5 FUNC    LOCAL  DEFAULT   13 search_object.cold
  1153: 0000000000405c65     5 FUNC    LOCAL  DEFAULT   13 _Unwind_Find_FDE.cold

Simply extracting function boundaries from readelf does not work here.

I haven't tested Dia2Dump and pyelftools, so it is hard for me to tell whether your result is affected here.

jonasguan commented 3 years ago

Thanks for the example @01FreshMan , I have a better understanding of your concern now. I think this boils down to what you consider a function, and this can get a bit philosophical. I’d actually argue that for most practical purposes of disassembly, it’s correct to label these cold blocks as functions.

Here’s what I believe to be the crux of the problem: are the functions as defined in the source code the ground truth, or are the functions as used in the compiled binary the ground truth? If we define the goal of function recovery as to recover the functions exactly as they are defined in the source code, then clearly these cold blocks should not be labeled as functions. We also cannot hope to accurately recover all function boundaries, because the structure of functions are often altered or removed during compilation for optimization.

But that might not be the most useful definition for our purposes. The goal of recovering function boundaries (and most of disassembly) is almost always to serve some other task (e.g. reverse engineering, rewriting, instrumentation, malware detection…), usually when source code is unavailable. In many of these scenarios, it doesn’t really matter how the function was originally defined in the source code; what I want to know is how the compiled binary behaves at runtime, e.g. which addresses it can call. In this view, since the cold blocks are used as functions, they should be recovered as functions. While the controversy of whether cold blocks are “fake” functions is new to me, I believe this is the main reason my more knowledgeable colleagues in this field has not raised this issue regarding these datasets, and why it is accepted practice to use symbol tables to generate ground truth.

In fact, IDA also labels the cold blocks in the gcc binary you provided as functions, and judging from the output of your readelf, it seems that readelf does too. Although I cannot confirm if this was all fully by intention, I hope it at least shows that this is a reasonable view widely used in practice, and that we are not biased against other tools in our definition or evaluation.

01FreshMan commented 3 years ago

Hi @jonasguan,

Thanks for you patience.

There is one thing I may justify.

In fact, IDA also labels the cold blocks in the gcc binary you provided as functions, and judging from the output of your readelf, it seems that readelf does too. Although I cannot confirm if this was all fully by intention, I hope it at least shows that this is a reasonable view widely used in practice, and that we are not biased against other tools in our definition or evaluation.

IDA Pro would not label these cold blocks as functions if you strip the binary. The reason why IDA Pro and readelf do so is because there is a fake function symbol associated with each cold block. It is more about compilers instead of disassemblers. (I may be wrong here, since I simply check whether such addresses of cold blocks are presented in the function windows and do not know what you exactly do). If I am wrong, would you mind providing the IDA script extracting the function boundaries? ;P

To tell the truth, I really think XDA is a great work. All my concern is about the evaluation part, where I feel kinda upset after getting excited about the technique itself. I apologize that I may speak it bluntly. I think the evaluation itself is not carefully-designed or at least not well articled (otherwise we would not have such discussion about cold blocks).

For example, given the fact that around 10% of function symbols of the provided gcc binary are instead cold blocks, since your ground truth (maybe but likely) count them as functions but IDA Pro will not(?), the recall of IDA would be heavily under-estimated. Like what we did of removing the thunks and trampolines, we should have put these functions (cold blocks) back for a fair comparison.

I think this boils down to what you consider a function, and this can get a bit philosophical. I’d actually argue that for most practical purposes of disassembly, it’s correct to label these cold blocks as functions.

Here’s what I believe to be the crux of the problem: are the functions as defined in the source code the ground truth, or are the functions as used in the compiled binary the ground truth? If we define the goal of function recovery as to recover the functions exactly as they are defined in the source code, then clearly these cold blocks should not be labeled as functions. We also cannot hope to accurately recover all function boundaries, because the structure of functions are often altered or removed during compilation for optimization.

Nevertheless, if such statement had been placed in the paper, most confusion or misunderstanding (like what I got before) should be gone. Anyway, I think you really answered all my questions. And I am good then.

Thanks!