GrammaTech / ddisasm

A fast and accurate disassembler
https://grammatech.github.io/ddisasm/
GNU Affero General Public License v3.0
646 stars 60 forks source link

[?] outline functions in the disassembly #52

Open Trass3r opened 2 years ago

Trass3r commented 2 years ago

These facts are analyzed to identify code location, symbolization, and function boundaries.

Yet in the resulting (edit: MASM) disassembly most functions are simple labels, indistinguishable from function-internal block labels. Even an exported function is just an empty PROC EXPORT ENDP block in front of the function. Is there a way to improve this or is it simply not implemented?

Also, are non-trivial differences in the assembly to be expected? When comparing the original executable and the re-assembled one I saw NFCs like alignment or lea [x] vs lea [x+0], but also completely different registers and instructions sometimes.

aeflores commented 2 years ago

The GTIRB representation has more specific details on the function boundaries. E.g. the aux data table "functionBlocks" has a list of basic blocks contained in each function. This information is currently ignored by the MASM pretty printer. Using PROC definitions was causing problems for us in certain cases, so we avoided using them, but it's probably time to revisit that decision.

Regarding the non-trivial differences, if you can give more details or example on instructions using different registers and different instructions, we could look into it further. I don't see a reason why that should be the case.

Trass3r commented 2 years ago

This information is currently ignored by the MASM pretty printer. Using PROC definitions was causing problems for us in certain cases, so we avoided using them, but it's probably time to revisit that decision.

Ah I already suspected something like this. +1

I don't see a reason why that should be the case.

Ok thanks good to know. Not sure what the best way to compare would be so I reassembled the file and compared the objdump disassembly with TortoiseMerge. It allows you to set up regex filters to ignore hex numbers ([0-9a-fA-F]+) so you mostly get the actual differences.

I just noticed that the first big difference is actually a jump table, so the disasm is just misleading and it's ok:

jmp    DWORD PTR [eax*4+0x1000e263]

https://gist.github.com/Trass3r/db3a655ef9e5a5e32bd704786f0d52f6#file-org-asm-L15473

There are also subtle cases where an int3 is missing after a call, seems like it's a noreturn function call:

1000ccf1:   e8 0b 1a 00 00          call   0x1000e701
1000ccf6:   cc                      int3   
1000ccf7:   8b ff                   mov    edi,edi
1000ccf9:   55                      push   ebp

Is there a better approach? I guess the first step would be to fix the alignment to match. This binary actually works re-assembled but in another test case there were noticeable behavior changes.

Trass3r commented 2 years ago

Ok seems like alignment is fine, msvc link inserts a bunch of CC bytes at the start of .text for whatever reason. Using lld-link fixes that but the offsets will still get out of sync by different encodings being used for some instructions. Regarding the jump tables, maybe they can be annotated somehow to mark them as data, fwiw.

            leave 
            ret 
          BYTE 08bH
          BYTE 0ffH
$L_1000e263           DWORD $L_1000da4d
          DWORD $L_1000d84c
          DWORD $L_1000d87e
          DWORD $L_1000d8da
          DWORD $L_1000d926
          DWORD $L_1000d932
          DWORD $L_1000d978
          DWORD $L_1000daa8