NationalSecurityAgency / ghidra

Ghidra is a software reverse engineering (SRE) framework
https://www.nsa.gov/ghidra
Apache License 2.0
50.72k stars 5.79k forks source link

Reduced Function Start Identification in Ghidra9.1 compared to Ghidra 9.0.4 for ARM binaries #1532

Closed gsrishaila closed 1 year ago

gsrishaila commented 4 years ago

Describe the bug The number of correctly detected function starts found by Ghidra 9.1 is lesser that Ghidra 9.0.4 for ELF32 Little Endian stripped ARM binaries. We noticed this trend by comparing the function start identified by both versions of Ghidra against the function start addresses given by the DWARF utility in the GCC compiler during compilation. The following link shows a recent paper that uses this technique to find out the addresses in the binary that correspond to actual function starts. https://www.usenix.org/conference/cset19/speaker-or-organizer/sri-shaila-g-university-california-riverside

We compared the function start identification capability in both Ghidra versions on a testset 10 benign binaries from the SPEC 2017 benchmark. We found that the number of correctly identified function starts found by Ghidra9.0.4 is higher than in Ghidra9.1 for ARM binaries.

To Reproduce We take the 544.nab_r binary found in SPEC 2017 benchmark as an example. Other binaries can also be used. 1.Compile the source code into ELF32 Little Endian ARM binaries by using the GCC ARM compiler. Use the -g3 flag to attach debugging information to the dynamically linked unstripped binary.

2.The debugging information attached to the unstripped binary will give the addresses in the binary that correspond to actual function starts. We will refer to this list of function starts as the ground truth. [A reasonable but less accurate alternative to getting the ground truth is to use the disassembler on unstripped version of the binary to find the function starts, but there might be a few falsely identified functions.]

  1. Compile the source code into ELF32 Little Endian stripped ARM binaries by using the GCC compiler. Use the -s flag to produce a dynamically linked stripped binary.

  2. Disassemble the stripped binary from step 3 by using Ghidra 9.0.4 and record the function starts found by Ghidra 9.0.4 that are also found in the ground truth.

  3. Disassemble the stripped binary from step 3 by using Ghidra 9.1 and record the function starts found by Ghidra 9.1 that are also found in the ground truth.

  4. Compare the number of functions that are correctly identified in both version. In our observation, Ghidra9.0.4 could identify the functions at the following locations. However, Ghidra9.1 could not. These functions are also found in the ground truth.

0x22654 0x226a8 0x22704 0x22758 0x2279c 0x2847c 0x28904 0x27c3c 0x291ac

Expected behavior

We expected that Ghidra 9.1 would be able to identify the same number of correct function starts as Ghidra9.0.4 or identify more correct function starts.

Screenshots If applicable, add screenshots to help explain your problem. Ghidra90_0x226a8 Ghidra9.0.4 can identify the function start at address 0x226a8

Ghidra91_0x226a8 Ghidra9.1 cannot identify the function start at address 0x226a8

Ghidra90_0x22654 Ghidra9.0.4 can identify the function start at address 0x22654

Ghidra91_0x22654 Ghidra9.1 cannot identify the function start at address 0x22654

Attachments If applicable, please attach any files that caused problems or log files generated by the software.

I have attached the compiled ARM binaries, both the stripped and the unstripped versions for the nab program from SPEC 2017 benchmark.

Nab_Binaries.zip

I have attached the original ARM_LE_Pattern.xml file from both Ghidra versions. I have also attached the patched version(according to the description below) of ARM_LE_Pattern.xml for Ghidra9.1.

GhidraFunctionBytePatterns.zip

Environment (please complete the following information): OS: Ubuntu 18.04 LTS and Windows 10 Home Ed Java Version: Open jdk version 11.0.6 Ghidra Version: Ghidra 9.0.4 and Ghidra 9.1

Additional context The Fix/Patch On further analysis, we found that the following line was added into the ARM_LE_patterns.xml file for Ghidra9.1

<align mark="0" bits="3"/>

This line prevents the function byte pattern rule associated with it to be applied to the ARM ELF binary under analysis. This decreases the number of correctly identified function starts found by Ghidra 9.1. Removing this line improves the number of functions that are correctly identified by Ghidra 9.1. When this line is removed, both versions are able to identify similar number of function starts correctly.

We also found that Ghidra9.1.1 was also not able to identify the function starts stated above for the 544.nab_r binary. It can recover more function starts if the patched version of the ARM_LE_Pattern.xml attached to this report is used instead of the original version.

emteere commented 1 year ago

The patterns appear to be fixed in the latest 10.2.x. There are a few tweaks that could be done for the instruction stmdb sp!,{r4, lr}

There was a bug fixed in a prior version that caused many patterns to fail to match and apply that involved the bits constraint.

There are other function/code discovery methods that are in development that will recover even more functions, that would be better than adding more patterns.

This is a great example, and we'll use it to continue to improve. There is only one address, 0x291ac, above that was not recovered with the current version. This location isn't a good pattern start, but could be recovered via other methods.

Closing this because the pattern issue has been resolved.