NationalSecurityAgency / ghidra

Ghidra is a software reverse engineering (SRE) framework
https://www.nsa.gov/ghidra
Apache License 2.0
49.63k stars 5.71k forks source link

Incorrect Arm Variant Detection #1521

Open astrelsky opened 4 years ago

astrelsky commented 4 years ago

Describe the bug Arm processor variant detection isn't correct.

To Reproduce Steps to reproduce the behavior:

  1. Compile arm binary with the following CFLAGS -marm -mcpu=arm1176jzf-s
  2. Import binary into ghidra.
  3. See variant defaulted to "ARM:LE:32:v8:default".
  4. See only recommended specs of "ARM:LE:32:v8:default" and "ARM:LE:32:v8:windows"

Expected behavior Proper variant detection. The results of readelf -h binary is as follows.

ELF Header:
  Magic:   7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00
  Class:                             ELF32
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              EXEC (Executable file)
  Machine:                           ARM
  Version:                           0x1
  Entry point address:               0x8050
  Start of program headers:          52 (bytes into file)
  Start of section headers:          448160 (bytes into file)
  Flags:                             0x5000400, Version5 EABI, hard-float ABI
  Size of this header:               52 (bytes)
  Size of program headers:           32 (bytes)
  Number of program headers:         2
  Size of section headers:           40 (bytes)
  Number of section headers:         22
  Section header string table index: 21

Environment (please complete the following information):

mumbel commented 4 years ago

Opinions for ELF work of machine and flags. ARM's flags don't encode -mcpu, that is found in .ARM.attributes section header, which can't be parsed until it already knows its some language variant of ARM

astrelsky commented 4 years ago

Opinions for ELF work of machine and flags. ARM's flags don't encode -mcpu, that is found in .ARM.attributes section header, which can't be parsed until it already knows its some language variant of ARM

I only specified the -mcpu to make it easier to reproduce. I'm not even sure what variant I should use as I get base opcodes from everything I select.

mumbel commented 4 years ago

Besides the actual name, the actual architecture variant is there in the Tag_CPU_arch (think 0x6).

PT_ARM_ARCHEXT_ARCH_UNKN | 0x00 | The needed architecture is unknown or specified in some other way
PT_ARM_ARCHEXT_ARCHv4 | 0x01 | Architecture v4
PT_ARM_ARCHEXT_ARCHv4T | 0x02 | Architecture v4T
PT_ARM_ARCHEXT_ARCHv5T | 0x03 | Architecture v5T
PT_ARM_ARCHEXT_ARCHv5TE | 0x04 | Architecture v5TE
PT_ARM_ARCHEXT_ARCHv5TEJ | 0x05 | Architecture v5TEJ
PT_ARM_ARCHEXT_ARCHv6 | 0x06 | Architecture v6
PT_ARM_ARCHEXT_ARCHv6KZ | 0x07 | Architecture v6KZ
PT_ARM_ARCHEXT_ARCHv6T2 | 0x08 | Architecture v6T2
PT_ARM_ARCHEXT_ARCHv6K | 0x09 | Architecture v6K
PT_ARM_ARCHEXT_ARCHv7 | 0x0A | Architecture v7 (in this case the architecture profile may also be required to fully specify the needed execution environment)
PT_ARM_ARCHEXT_ARCHv6M | 0x0B | Architecture v6M (e.g. Cortex-M0)
PT_ARM_ARCHEXT_ARCHv6SM | 0x0C | Architecture v6S-M (e.g. Cortex-M0)
PT_ARM_ARCHEXT_ARCHv7EM | 0x0D | Architecture v7E-M

I think there would have to be some change to how ElfLoader interacts with ElfExtension factories to be able to parse solely based on primary probably and then a 2nd pass with secondary if needed

ghidra1 commented 4 years ago

This architecture information is contained within each Program Header p_type field (there can be more than one such header) which is not accessible during the opinion phase. The ELF opinion mechanism only has access to the e_machine (primary) and e_flags (secondary) fields.

What issues do you encounter when using the v8 variant instead of the variant indicated within the header?

astrelsky commented 4 years ago

This architecture information is contained within each Program Header p_type field (there can be more than one such header) which is not accessible during the opinion phase. The ELF opinion mechanism only has access to the e_machine (primary) and e_flags (secondary) fields.

What issues do you encounter when using the v8 variant instead of the variant indicated within the header?

I was initially seeing a bunch of bad instruction markers, however it appears to clear up after letting the analysis run since its actually data. Is there a quick way I could compare the disassembly of the same binary using two different variants?

nsajko commented 4 years ago

Maybe this is the wrong place to say it, but the options for selection of ARM variants are very confusing, and seem like a wrong way to organize things.

Firstly, v8 is the only one that is not hidden by default and it is not clear to me (the user) why is it so.

Secondly, it is not clear what do the different options even mean: the worst offender is the Cortex option: is it for Cortex-M (microcontrollers), Cortex-A (application processors, like in servers or phones), or Cortex-R (Cortex-A with real-time features). Even the "v8", "v7", "v6" options are unclear in that regard. Apart from the whole "profiles" (M, A, R) thing, ARM has multiple general purpose instruction sets which can be switched in run time, all still in use today: A32 (old ARM) and T32 (Thumb) and the new A64. This further complicates things and the instruction sets available depend on the core.

To be more specific, do I choose v7 for both Cortex-A8 (v7-A, runs A32 and T32) and Cortex-M7 (v7E-M, run just T32)? Or should I choose Cortex?

Also note that for example ARM1176JZF-S (ARMv6) runs only A32, while an ARMv6-M (Cortex-M0, for example) core runs only T32.

ghidra1 commented 4 years ago

Is there a quick way I could compare the disassembly

@astrelsky : you can use the "Diff" (side-by-side) view from the Listing toolbar. Hopefully it will work for this situation, although mileage may vary when used with different language implementations. I think it should work since the memory architecture should match.

ghidra1 commented 4 years ago

@nsajko : ARM language variants are generally simplified to include all extensions supported by the architecture version.

Unfortunately, the "Cortex" variant name is rather ambiguous. The "Cortex" variant employs the ARMv7 language spec (sla) with a different pspec (ARMCortex vs ARMt). It would seem that the "Cortex" variant uses Thumb mode by default and defines some different address vector symbols. I believe it is intended to model ARMv7-M / Cortex-M3. I am uncertain if all ARMv7 Cortex-M extensions have been implemented (e.g., Cortex-M4 and M7).

All ARMv7 Ghidra language specs (sla) support both A32 and T32, although the intent for the "Cortex" variant (i.e., ARMv7-M) is for it to remain in Thumb mode. It intended that T32-only and A32-only code will not encounter a context switch to the other mode, although bad disassembly could cause this condition to occur.

NOTE: The term Thumb is used to refer to all thumb-based instruction sets.

astrelsky commented 4 years ago

Is there a quick way I could compare the disassembly

@astrelsky : you can use the "Diff" (side-by-side) view from the Listing toolbar. Hopefully it will work for this situation, although mileage may vary when used with different language implementations. I think it should work since the memory architecture should match.

I'll give it a shot later today.

What are your thoughts on having a class extend ElfLoader specifically for ARM that can auto detect the variant? Even if there is no difference in disassembly it may help the user if they don't know what variant the binary is. Knowing what the correct variant is would help the user locate appropriate information on the processor. Of course such a thing would be optional.