albertan017 / LLM4Decompile

Reverse Engineering: Decompiling Binary Code with Large Language Models
https://arxiv.org/abs/2403.05286
MIT License
3.14k stars 231 forks source link

Merry this with a reverse engineeing framework like Rizin #5

Open Rot127 opened 7 months ago

Rot127 commented 7 months ago

Thanks for the very cool project! We folks from Rizin were wondering, if the results and usability wouldn't be way better, if the decompiler was built on top of a proper reverse engineering framework?

The most immediate advantage would be, that the model can be trained on an IL, instead of assembly. This would allow to decompile any function build for the OS this was trained on. Independent on the architecture, since the model only argues based on the IL and not on machine specific assembly.

Additionally, in the framework we can implement several algorithms which reduce noise in the IL fed to the model. E.g. if obfuscation patterns are present, the framework can can first resolve them and afterwards pass them to the model.

Same for type inference. Algorithms better suited for the job could first determine as many types as possible and later pass them additionally to the model (assuming the model is also trained on inferred types). Same applies for flow graphs and whatever such a model can be trained on. Additionally, you don't need to implement parsing and loading of binary types (see https://github.com/albertan017/LLM4Decompile/issues/1), but only get the functions and the details of it via the API. Which in turn gives you more time to enhance the model.

If you are interested feel free to take a look at Rizin!

Rot127 commented 7 months ago

Here an example of our IL (RzIL) and CFG for a binary of the Hexagon architecture:

┌ int main(int argc, char **argv, char **envp);
│           0x00005110      ?   allocframe(SP,#0x8):raw
│           0x00005114      [   R2 = add(FP,##-0x4)
│           0x00005118      [   memw(R2+#0x0) = ##0x0
│           0x0000511c      [   R2 = add(FP,##-0x8)
│           0x00005120      [   memw(R2+#0x0) = ##0x0
│       ┌─< 0x00005124      [   jump 0x5128
│       │   ; CODE XREFS from main @ 0x5124, 0x5150
│      ┌└─> 0x00005128      [   R2 = memw(FP+##-0x8)
│      ╎    0x0000512c      [   P0 = cmp.gt(R2,##0x2)
│      ╎┌─< 0x00005130      [   if (P0) jump:nt 0x5154
│     ┌───< 0x00005134      [   jump 0x5138
│     │╎│   ; CODE XREF from main @ 0x5134
│     └───> 0x00005138      [   call sym.pHello
│      ╎│   0x0000513c      [   call sym.pWorld
│     ┌───< 0x00005140      [   jump 0x5144
│     │╎│   ; CODE XREF from main @ 0x5140
│     └───> 0x00005144      [   R2 = memw(FP+##-0x8)
│      ╎│   0x00005148      [   R2 = add(R2,##0x1)
│      ╎│   0x0000514c      [   memw(FP+##-0x8) = R2
│      └──< 0x00005150      [   jump 0x5128
│       └─> 0x00005154      [   R0 = ##0x0
└           0x00005158      [   LR:FP = dealloc_return(FP):raw
:> plf
0x5110 (seq empty (set jump_flag false) (set u (bv 32 0x8)) (set EA (cast 32 false (+ (var R29) (bv 32 0xfffffff8)))) (storew 0 (var EA) (cast 64 false (^ (| (<< (cast 64 false (var R31)) (bv 32 0x20) false) (cast 64 false (var R30_tmp))) (<< (cast 64 false (var C17)) (bv 32 0x20) false)))) (set R30_tmp (cast 32 false (cast 32 false (var EA)))) (set R29_tmp (cast 32 false (cast 32 false (cast 32 false (- (var EA) (var u)))))) empty (set R29 (var R29_tmp)) (set R30 (var R30_tmp)) (branch (var jump_flag) (jmp (var jump_target)) (jmp (bv 32 0x5114))))
0x5114 (seq empty (set jump_flag false) (set s (bv 32 0xfffffffc)) (set R2_tmp (cast 32 false (cast 32 false (+ (var R30) (var s))))) empty (set R2 (var R2_tmp)) (branch (var jump_flag) (jmp (var jump_target)) (jmp (bv 32 0x5118))))
0x5118 (seq empty (set jump_flag false) (set u (bv 32 0x0)) (set S (bv 32 0x0)) (set EA (+ (cast 32 false (var R2)) (var u))) (storew 0 (var EA) (cast 32 false (var S))) empty (branch (var jump_flag) (jmp (var jump_target)) (jmp (bv 32 0x511c))))
0x511c (seq empty (set jump_flag false) (set s (bv 32 0xfffffff8)) (set R2_tmp (cast 32 false (cast 32 false (+ (var R30) (var s))))) empty (set R2 (var R2_tmp)) (branch (var jump_flag) (jmp (var jump_target)) (jmp (bv 32 0x5120))))
0x5120 (seq empty (set jump_flag false) (set u (bv 32 0x0)) (set S (bv 32 0x0)) (set EA (+ (cast 32 false (var R2)) (var u))) (storew 0 (var EA) (cast 32 false (var S))) empty (branch (var jump_flag) (jmp (var jump_target)) (jmp (bv 32 0x5124))))
0x5124 (seq empty (set jump_flag false) (set r (bv 32 0x4)) (set r (& (var r) (bv 32 0xfffffffc))) (set jump_flag true) (set jump_target (+ (bv 32 0x5124) (cast 32 false (var r)))) empty (branch (var jump_flag) (jmp (var jump_target)) (jmp (bv 32 0x5128))))
...
> agF
:> agF
          ┌──────────────┐
          │  0x5110 ↓    │
          └──────────────┘
              v
              │
              │
          ┌──────────────┐
          │  0x5114 ○    │
          └──────────────┘
              v
              │
              │
          ┌──────────────┐
          │  0x5118 ○    │
          └──────────────┘
              v
              │
              │
          ┌──────────────┐
          │  0x511c ○    │
          └──────────────┘
              v
              │
              │
          ┌──────────────┐
          │  0x5120 ○    │
          └──────────────┘
              v
              │
              │
          ┌──────────────┐
          │  0x5124 ○    │
          └──────────────┘
              v
              │
              └─┐ ┌──────────────────┐
                │ │                  │
          ┌──────────────┐           │
          │  0x5128 ○    │           │
          └──────────────┘           │
              v                      │
              │                      │
              │                      │
          ┌──────────────┐           │
          │  0x512c ○    │           │
          └──────────────┘           │
              v                      │
              │                      │
              │                      │
          ┌──────────────┐           │
          │  0x5130 ⤹    │           │
          └──────────────┘           │
                t f                  │
                │ │                  │
    ┌───────────┘ │                  │
    │             └─────┐            │
    │                   │            │
┌──────────────┐    ┌──────────────┐ │
│  0x5154 ○    │    │  0x5134 ○    │ │
└──────────────┘    └──────────────┘ │
    v                   v            │
    │                   │            │
    │                   │            │
┌──────────────┐    ┌──────────────┐ │
│  0x5158 ↑    │    │  0x5138 ○    │ │
└──────────────┘    └──────────────┘ │
                        v            │
                        │            │
                        │            │
                    ┌──────────────┐ │
                    │  0x513c ○    │ │
                    └──────────────┘ │
                        v            │
                        │            │
                        │            │
                    ┌──────────────┐ │
                    │  0x5140 ○    │ │
                    └──────────────┘ │
                        v            │
                        │            │
                        │            │
                    ┌──────────────┐ │
                    │  0x5144 ○    │ │
                    └──────────────┘ │
                        v            │
                        │            │
                        │            │
                    ┌──────────────┐ │
                    │  0x5148 ○    │ │
                    └──────────────┘ │
                        v            │
                        │            │
                        │            │
                    ┌──────────────┐ │
                    │  0x514c ○    │ │
                    └──────────────┘ │
                        v            │
                        │            │
                        │            │
                    ┌──────────────┐ │
                    │  0x5150 ○    │ │
                    └──────────────┘ │
                        v            │
                        │            │
                        └────────────┘

And for ARM with full analysis:

:> pdf
        ╎   ; CALL XREF from sym._init @ 0x1027c
┌ sym.call_weak_fn();
│       ╎   0x00010304      ldr   r3, [aav.aav.0x00010294]             ; [data.00010320:4]=0x10294 aav.0x00010294
│       ╎   0x00010308      ldr   r2, [data.00010324]                  ; [0x10324:4]=28
│       ╎   0x0001030c      add   r3, pc, r3                           ; 0x205a8
│       ╎                                                              ; obj._GLOBAL_OFFSET_TABLE
│       ╎   0x00010310      ldr   r2, [r3, r2]                         ; 0x205c4
│       ╎                                                              ; reloc.__gmon_start.205c4
│       ╎   ; DATA XREF from sym..plt @ 0x10288
│       ╎   ; UNKNOWN XREF from aav.0x00010294 @ 
│       ╎   ;-- aav.0x00010314:
│       ╎   0x00010314      cmp   r2, 0
│       ╎   0x00010318      bxeq  lr
└       └─< 0x0001031c      b     loc.imp.__gmon_start
> plf
0x10304 (set r3 (loadw 0 32 (bv 32 0x10320)))
0x10308 (set r2 (loadw 0 32 (bv 32 0x10324)))
0x1030c (set r3 (+ (bv 32 0x10314) (var r3)))
0x10310 (set r2 (loadw 0 32 (+ (var r3) (var r2))))
0x10314 (seq (set a (var r2)) (set b (bv 32 0x0)) (set res (- (var a) (var b))) (set cf (ule (var b) (var a))) (set vf (&& (^^ (msb (var a)) (msb (var b))) (^^ (msb (var a)) (msb (var res))))) (set zf (is_zero (var res))) (set nf (msb (var res))))
0x10318 (branch (var zf) (jmp (var lr)) nop)
0x1031c (jmp (bv 32 0x102b0))
VelocityRa commented 7 months ago

Rizin/cutter in my experience is - to put it bluntly - a toy compared to Ghidra (especially the decompilers).

So it's a good idea to train on IL, but perhaps a better choice would be to train on Ghidra's IR, P-code. It supports lifting from 30+ architectures and the decompiler is far more advanced.

XVilka commented 7 months ago

The P-code is very old and outdated. RzIL is based on state-of-the-art research conducted by the CMU team in their BAP framework. It is not a toy. And, unlike Ghidra, it's written in plain C and designed to be used as a library. Embedding Ghidra into something is always problematic. Moreover, there exists rz-ghidra that integrates the Ghidra decompiler into the Rizin/Cutter.

VelocityRa commented 7 months ago

The P-code is very old and outdated.

No explanation here - outdated how? Most real-world target architectures used today are 'outdated' too, why would IL formats need to fundamentally change in recent years? If it's mostly just to support new architectures (that noone uses), I would not place much importance on this.

RzIL is based on state-of-the-art research conducted by the CMU team in their BAP framework.

Who? Why are they more qualified than a large team of paid National Security Agency engineers/researchers working on this problem for decades?

Also the doc you linked explains almost nothing besides containing some high-level/abstracted academic self-m*sturbation. And of course contains no comparisons to existing research/tooling.

I've been active in research-engineering spaces for years (including decompilation projects - a target audience) and I can count the number of people with something positive to say about cutter on one hand (especially post-Ghidra). I trust that anecdotal experience alone more than the comment of someone with "Rizin and Cutter evangelist" in their bio.

And, unlike Ghidra, it's written in plain C

I would consider this a disadvantage.

and designed to be used as a library.

Ghidra can run headless for future tooling integration, and there are ways to use C/C++ code or external native tools with it. Either way I see your point, this is your strongest argument I think, but in my opinion the quality of the output matters much more than any possible annoyances here.

Moreover, there exists rz-ghidra that integrates the Ghidra decompiler into the Rizin/Cutter.

Indeed, it's even shown in the only screenshot on the cutter readme, which actually makes me less confident about how good rizin/cutter itself is.


Edit: I would be glad to be proven wrong btw, and sorry if I got personal there. I'm always looking for better RE tooling. Just not really convinced it would be good to lock future such projects into a rizin/cutter workflow, which I consider to not be an 'industry standard', or even a 'hobbyist/open-source standard' post-Ghidra.

albertan017 commented 7 months ago

Thanks! We're working on Ghidra now, as it's widely employed in RE. Rizin looks also very interesting and we will study it!