angr / pypcode

Python bindings to Ghidra's SLEIGH library for disassembly and lifting to P-Code IR
https://api.angr.io/projects/pypcode/en/latest/
Other
175 stars 23 forks source link

ARM Thumb and MIPS16e Disassembly Failure #115

Closed MustBastani closed 3 weeks ago

MustBastani commented 4 weeks ago

Description

I need to perform symbolic execution on a MIPS 32-bit binary which contains some MIPS16e instructions, and pypcode fails to disassemble/translate the binary. Similarly, it fails to disassemble a binary with ARM Thumb instructions (Both ISAs have instructions with alignment 2). While, UberEngine with VEX IR successfully disassembles the ARM binary.

Steps to reproduce the bug

  1. ARM angr example w/ UberEngine: --> Successful
In [1]: proj = angr.load_shellcode(b'\x2d\xe9\xf0\x41\x82\xb0\xdd\xf8\x20\x80\x06\x46\x1c\x46\x17\x46\x0d\x46\x40\x46\xa0\xf1\x3a\xf5', arch=archinfo.arch_from_id("ARM32", "LE"), load_address=0x40d59374, start_offset=1)
INFO     | 2024-08-20 11:56:40,685 | angr.project   | Loading binary from stream

In [2]: cfg = proj.analyses.CFG(normalize=True)
INFO     | 2024-08-20 11:56:44,340 | angr.analyses.cfg.cfg_base | Loaded 2 indirect jump resolvers (0 timeless, 2 generic).
INFO     | 2024-08-20 11:56:44,341 | angr.analyses.cfg.cfg_fast | Loaded 0 exception handlings from 0 binaries.
INFO     | 2024-08-20 11:56:44,341 | angr.analyses.cfg.cfg_fast | Found 1 functions with prologue scanning.

In [3]: for node in sorted(cfg.model.nodes(), key=lambda n: n.addr):
    ...:     if not node.is_simprocedure:
    ...:         node.block.pp()

40d59374  push.w  {r4, r5, r6, r7, r8, lr}
40d59378  sub     sp, #0x8
40d5937a  ldr.w   r8, [sp,#0x20]
40d5937e  mov     r6, r0
40d59380  mov     r4, r3
40d59382  mov     r7, r2
40d59384  mov     r5, r1
40d59386  mov     r0, r8
40d59388  bl      #0x412f9e00
  1. ARM angr example w/ UberEnginePcode:
In [1]: proj = angr.load_shellcode(b'\x2d\xe9\xf0\x41\x82\xb0\xdd\xf8\x20\x80\x06\x46\x1c\x46\x17\x46\x0d\x46\x40\x46\xa0\xf1\x3a\xf5', arch=archinfo.Arc
   ...: hPcode("ARM:LE:32:v8"), load_address=0x40d59375, start_offset=0)
INFO     | 2024-08-20 12:00:30,011 | angr.project   | Loading binary from stream
WARNING  | 2024-08-20 12:00:30,012 | angr.factory   | Creating project with the experimental 'UberEnginePcode' engine
DEBUG    | 2024-08-20 12:00:30,012 | angr.project   | hooking 0x40e00000 with <SimProcedure CallReturn>
DEBUG    | 2024-08-20 12:00:30,012 | angr.project   | hooking 0x40e00008 with <SimProcedure UnresolvableJumpTarget>
DEBUG    | 2024-08-20 12:00:30,012 | angr.project   | hooking 0x40e00010 with <SimProcedure UnresolvableCallTarget>

In [2]: cfg = proj.analyses.CFG(normalize=True)
INFO     | 2024-08-20 12:00:34,261 | angr.analyses.cfg.cfg_base | Loaded 2 indirect jump resolvers (0 timeless, 2 generic).
DEBUG    | 2024-08-20 12:00:34,262 | angr.project   | hooking 0x40e00014 with <SimProcedure UnresolvableJumpTarget>
DEBUG    | 2024-08-20 12:00:34,262 | angr.project   | hooking 0x40e00018 with <SimProcedure UnresolvableCallTarget>
DEBUG    | 2024-08-20 12:00:34,262 | angr.analyses.cfg.cfg_base | CFG recovery covers 1 regions:
DEBUG    | 2024-08-20 12:00:34,262 | angr.analyses.cfg.cfg_base | ... 0x40d59375 - 0x40d5938d
INFO     | 2024-08-20 12:00:34,262 | angr.analyses.cfg.cfg_fast | Loaded 0 exception handlings from 0 binaries.
INFO     | 2024-08-20 12:00:34,262 | angr.analyses.cfg.cfg_fast | Found 0 functions with prologue scanning.
DEBUG    | 2024-08-20 12:00:34,263 | angr.analyses.cfg.cfg_fast | Returning a new recon address: 0x40d59375
DEBUG    | 2024-08-20 12:00:34,263 | angr.analyses.cfg.cfg_fast | Searching address 40d59375
DEBUG    | 2024-08-20 12:00:34,263 | angr.analyses.cfg.cfg_fast | Searching address 40d59376
DEBUG    | 2024-08-20 12:00:34,263 | angr.analyses.cfg.cfg_fast | Searching address 40d59375
DEBUG    | 2024-08-20 12:00:34,263 | angr.analyses.cfg.cfg_fast | Force-scanning to 0x40d59375
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[6], line 1
----> 1 cfg = proj.analyses.CFG(normalize=True)

File github.com/angr/angr-dev/angr/angr/analyses/analysis.py:216, in AnalysisFactory.__call__(self, *args, **kwargs)
    210 show_progressbar = kwargs.pop("show_progressbar", False)
    212 w = self.prep(
    213     fail_fast=fail_fast, kb=kb, progress_callback=progress_callback, show_progressbar=show_progressbar
    214 )
--> 216 r = w(*args, **kwargs)
    217 # clean up so that it's always pickleable
    218 r._progressbar = None

File github.com/angr/angr-dev/angr/angr/analyses/analysis.py:201, in AnalysisFactory.prep.<locals>.wrapper(*args, **kwargs)
    198 oself._progress_callback = progress_callback
    200 oself._show_progressbar = show_progressbar
--> 201 oself.__init__(*args, **kwargs)
    202 return oself

File github.com/angr/angr-dev/angr/angr/analyses/cfg/cfg.py:68, in CFG.__init__(self, **kwargs)
     65         raise OutdatedError(outdated_exception)
     67 # Now initializes CFGFast :-)
---> 68 CFGFast.__init__(self, **kwargs)

File github.com/angr/angr-dev/angr/angr/analyses/cfg/cfg_fast.py:843, in CFGFast.__init__(self, binary, objects, regions, pickle_intermediate_results, symbols, function_prologues, resolve_indirect_jumps, force_segment, force_smart_scan, force_complete_scan, indirect_jump_target_limit, data_references, cross_references, normalize, start_at_entry, function_starts, extra_memory_regions, data_type_guessing_handlers, arch_options, indirect_jump_resolvers, base_state, exclude_sparse_regions, skip_specific_regions, heuristic_plt_resolving, detect_tail_calls, low_priority, cfb, model, elf_eh_frame, exceptions, skip_unmapped_addrs, nodecode_window_size, nodecode_threshold, nodecode_step, indirect_calls_always_return, jumptable_resolver_resolves_calls, start, end, collect_data_references, extra_cross_references, **extra_arch_options)
    834 self._decoding_assumption_relations = None
    836 # A mapping between address and the actual data in memory
    837 # self._memory_data = { }
    838 # A mapping between address of the instruction that's referencing the memory data and the memory data itself
   (...)
    841
    842 # Start working!
--> 843 self._analyze()

File github.com/angr/angr-dev/angr/angr/analyses/forward_analysis/forward_analysis.py:247, in ForwardAnalysis._analyze(self)
    240 self._pre_analysis()
    242 if self._graph_visitor is None:
    243     # There is no base graph that we can rely on. The analysis itself should generate successors for the
    244     # current job.
    245     # An example is the CFG recovery.
--> 247     self._analysis_core_baremetal()
    249 else:
    250     # We have a base graph to follow. Just handle the current job.
    252     self._analysis_core_graph()

File github.com/angr/angr-dev/angr/angr/analyses/forward_analysis/forward_analysis.py:356, in ForwardAnalysis._analysis_core_baremetal(self)
    354 def _analysis_core_baremetal(self) -> None:
    355     if not self._job_info_queue:
--> 356         self._job_queue_empty()
    358     while not self.should_abort:
    359         if self._status_callback is not None:

File github.com/angr/angr-dev/angr/angr/analyses/cfg/cfg_fast.py:1488, in CFGFast._job_queue_empty(self)
   1485     if bytes_prefix is None:
   1486         # we are out of the mapped memory range - just return
   1487         return
-> 1488     if any(re.match(prolog, bytes_prefix) for prolog in self.project.arch.thumb_prologs):
   1489         addr |= 1
   1491 if addr % 2 == 0:
   1492     # another heuristics: take a look at the closest function. if it's THUMB mode, this address
   1493     # should be THUMB, too.

AttributeError: 'ArchPcode' object has no attribute 'thumb_prologs'
  1. ARM pypcode example:
In [1]: from pypcode import Context, PcodePrettyPrinter
In [2]: ctx = Context("ARM:LE:32:v8")
In [3]: shellcode = b'\x2d\xe9\xf0\x41\x82\xb0\xdd\xf8\x20\x80\x06\x46\x1c\x46\x17\x46\x0d\x46\x40\x46\xa0\xf1\x3a\xf5'
In [4]: dx = ctx.disassemble(shellcode, 0x40d59375, 0, len(shellcode), 99999)
In [5]: for ins in dx.instructions:
    ...:     print(f"{ins.addr.offset:#x}/{ins.length}: {ins.mnem} {ins.body}")
    ...:
0x40d59375/4: mvnmis lr,sp, lsr #0x12

The first four bytes are considered as an ARM instruction, but the disassembly failed at the second 4 bytes. I couldn’t find any APIs in both of pypcode and its C++ backend to set instruction alignment or change the ISA_MODE register. Also, I tried hacking pypcode by manually

but it still failed to disassemble by throwing BadDataError. https://github.com/angr/pypcode/blob/da0cff97026092759fda72113f83807f230b13b1/pypcode/sleigh/slghsymbol.cc#L2293-L2294

  1. MIPS pypcode example:
In [1]: from pypcode import Context, PcodePrettyPrinter
In [2]: ctx = Context("MIPS:LE:32:default")
In [3]: shellcode = b'\xfa\x64\xc1\x18\xc6\x28\x08\x04\x01\x72\xfb\x61\x5d\x67\x40\x1a\x19\x25\x90\xaa\x02\x67\x5d\x67\x00\xf0\x1b\x05'
In [4]: dx = ctx.disassemble(shellcode, 0x90489475, 0, len(shellcode), 99999)
---------------------------------------------------------------------------
BadDataError                              Traceback (most recent call last)
Cell In[8], line 1
----> 1 dx = ctx.disassemble(shellcode, 0x90489475, 0, len(shellcode), 99999)

BadDataError: r0x90489475: Unable to resolve constructor

Environment

I used angr 9.2.80.dev0, archinfo 9.2.80.dev0, and pypcode 3.0.3.dev0 for all the experiments.

Modifications: I used a modified version of Ghidra 11.1.1 to disassemble the above MIPS shellcode/binary (example 4), and here is the result (I also modified the pypcode MIPS processor accordingly):

     **************************************************************
     *                          FUNCTION                          *
     **************************************************************
                 undefined FUN_90489474()
                   assume ISA_MODE = 0x1
                   assume PAIR_INSTRUCTION_FLAG = 0x0
                 FUN_90489474

        90489474 fa 64           save       0x50,ra,s0-s1
        90489476 c1 18 c6 28     jal        FUN_9098a318
        9048947a 08 04           _addiu     a0,sp,0x20
        9048947c 01 72           cmpi       v0,0x1
        9048947e fb 61           btnez      LAB_90489476
        90489480 5d 67           move       v0,sp
        90489482 40 1a 19 25     jal        FUN_90489464
        90489486 90 aa           _lhu       a0,0x20(v0)
        90489488 02 67           move       s0,v0
        9048948a 5d 67           move       v0,sp
        9048948c 00 f0 1b 05     addiu      a1,sp,0x1b

You may receive a different error trying the example 2. angr has some minor bugs when using UberEnginePcode which I fixed as follow:

diff --git a/angr/engines/pcode/cc.py b/angr/engines/pcode/cc.py
index 787e48150..2e6f15e25 100644
--- a/angr/engines/pcode/cc.py
+++ b/angr/engines/pcode/cc.py
@@ -38,6 +38,26 @@ class SimCCRISCV(SimCC):
     RETURN_VAL = SimRegArg("a0", 8)

+class SimCCMips32(SimCC):
+    """
+    Default CC for MIPS32
+    """
+
+    ARG_REGS = ["a0", "a1", "a2", "a3"]
+    RETURN_ADDR = SimRegArg("ra", 4)
+    RETURN_VAL = SimRegArg("v0", 4)
+
+
+class SimCCArm32(SimCC):
+    """
+    Default CC for ARM32
+    """
+
+    ARG_REGS = ["r0", "r1", "r2", "r3"]
+    RETURN_ADDR = SimRegArg("lr", 4)
+    RETURN_VAL = SimRegArg("r0", 4)
+
+
 class SimCCSPARC(SimCC):
     """
     Default CC for SPARC
@@ -104,6 +124,8 @@ def register_pcode_arch_default_cc(arch: ArchPcode):
             "PowerPC:BE:32:e200": SimCCPowerPC,
             "PowerPC:BE:32:MPC8270": SimCCPowerPC,
             "Xtensa:LE:32:default": SimCCXtensa,
+            "MIPS:LE:32:default": SimCCMips32,
+            "ARM:LE:32:v8": SimCCArm32,
         }
         if arch.name in manual_cc_mapping:
             # first attempt: manually specified mappings
diff --git a/angr/engines/pcode/lifter.py b/angr/engines/pcode/lifter.py
index 20d0b9e23..ca752e7cc 100644
--- a/angr/engines/pcode/lifter.py
+++ b/angr/engines/pcode/lifter.py
@@ -861,7 +861,7 @@ class PcodeBasicBlockLifter:

         # Translate
         addr = baseaddr + bytes_offset
-        result = self.context.translate(data[bytes_offset : bytes_offset + max_bytes], addr, max_inst, max_bytes, True)
+        result = self.context.translate(bytes(data[bytes_offset : bytes_offset + max_bytes]), addr, bytes_offset, max_bytes, max_inst)
         irsb._instructions = result.instructions

         # Post-process block to mark exits and next block

Additional context

I am not sure if it was a good idea to put all this information in one issue 😕. Also, I'm willing to work on this issue. I just don't know what is the actual problem yet. I can provide additional information about the binary via email.

mborgerson commented 3 weeks ago

@MustBastani

Thanks for using angr+pypcode and filing this detailed bug report.

In [4]: ctx = Context("ARM:LE:32:v8T")
In [5]: shellcode = b'\x2d\xe9\xf0\x41\x82\xb0\xdd\xf8\x20\x80\x06\x46\x1c\x46\x17\x46\x0d\x46\x40\x46\xa0\xf1\x3a\xf5'
In [6]: dx = ctx.disassemble(shellcode, 0x40d59375, 0, len(shellcode), 99999)
In [8]: for ins in dx.instructions:
   ...:     print(f"{ins.addr.offset:#x}/{ins.length}: {ins.mnem} {ins.body}")
   ...: 
0x40d59375/4: push {r4,r5,r6,r7,r8,lr}
0x40d59379/2: sub sp,#0x8
0x40d5937b/4: ldr.w r8,[sp,#0x20]
0x40d5937f/2: mov r6,r0
0x40d59381/2: mov r4,r3
0x40d59383/2: mov r7,r2
0x40d59385/2: mov r5,r1
0x40d59387/2: mov r0,r8
0x40d59389/4: bl 0x412f9e01

angr's vex lifter supports automatic thumb mode decoding, but currently the pcode lifter does not. Issue filed here: https://github.com/angr/angr/issues/4778

The CFG analysis crash you ran into is caused by something else which I've filed an issue for here: https://github.com/angr/angr/issues/4779

You can set ISA_MODE value directly using Context::setVariableDefault, like so:

In [2]: ctx = Context("MIPS:LE:32:default")
In [3]: ctx.setVariableDefault('ISA_MODE', 1)
In [4]: shellcode = b'\xfa\x64\xc1\x18\xc6\x28\x08\x04\x01\x72\xfb\x61\x5d\x67\x40\x1a\x19\x25\x90\xaa\x02\x67\x5d\x67\x00\xf0\x1b\x05'
In [5]: dx = ctx.disassemble(shellcode, 0x90489475, 0, len(shellcode), 99999)
In [7]: for ins in dx.instructions:
   ...:     print(f"{ins.addr.offset:#x}/{ins.length}: {ins.mnem} {ins.body}")
   ...: 
0x90489475/2: save 0x50,ra,s0-s1
0x90489477/4: jal 0x9098a318
0x9048947b/2: addiu a0, sp, 0x20
0x9048947d/2: cmpi v0, 0x1
0x9048947f/2: btnez 0x90489477
0x90489481/2: move v0, sp
0x90489483/4: jal 0x90489464
0x90489487/2: lhu a0, 0x20(v0)
0x90489489/2: move s0, v0
0x9048948b/2: move v0, sp
0x9048948d/4: addiu a1, sp, 0x1b

Unfortunately pypcode doesn't have an intelligent way to automatically determine what mode things should be in, as pypcode itself is a thin wrapper around SLEIGH. As you know, Ghidra handles these mode switches in it's Java based architecture extensions. Likewise, we try to handle some of this in angr, but it is not feature complete.

With the remaining issues now filed individually, I'll close this issue. If you have more problems, feel free to file another issue. Thanks again for taking the time to file this detailed bug report.