EiNSTeiN- / decompiler

A decompiler with multiple backend support, written in Python. Works with IDA and Capstone.
526 stars 103 forks source link

Please make "ir-parser" "disassembler" a first-class component #9

Open pfalcon opened 9 years ago

pfalcon commented 9 years ago

IR is the most important part of this project. Converting assembly to IR is a straightforward grunt work. Please let interested parties skip that boring part and start straight at experimenting with decompilation. Thanks.

pfalcon commented 9 years ago

Ah, and yes, IR needs to be documented ;-).

EiNSTeiN- commented 9 years ago

I'm not sure what you mean by that. Converting assembly into IR is done by language-specific modules (see src/ir and src/host/*/dis), it's not intended to be done by hand. The "ir parser" is meant for testing the decompiler steps without coupling the tests to a specific disassembler. Outside of that specific use case, it's much too limited to drive a full fledged decompiler, because there would be no way to express things like operand size, operator type (floating point additions vs. integer, etc).

pfalcon commented 9 years ago

I'm not sure what you mean by that. Converting assembly into IR is done by language-specific modules (see src/ir and src/host/*/dis)

Then you strongly couple your decompiler to particular mundane disassemblers there. If I don't own IDA and have architecture not supported by Capstone (and it supports only the pop-up-to-boringness ones ;-) ), then I'm hosed - in a sense that I need to dig very deep into many aspects of your decompiler to interface it to something else, instead of just making that "something else" output a standard IR textual form and feed it into your decompiler.

it's much too limited to drive a full fledged decompiler, because there would be no way to express things like operand size, operator type (floating point additions vs. integer, etc).

Ok, if your IR supports all those features, can you please consider extending the syntax, and adding parsing support for that (I assume dump support already work)?

And I assume you made your own IR syntax for a reason, and I can give only +1 on that, because when you look into some existing solution, you immediately get an expression that it's over-engineered, but well, a subset of LLVM syntax might work ;-).

EiNSTeiN- commented 9 years ago

Then you strongly couple your decompiler to particular mundane disassemblers there.

src/ir is the generic disasembler-to-ir code, src/host is strongly coupled to the underlying disassembler (IDA, capstone) but can be ported over to other disassemblers fairly easily with the generic part (in  src/ir) saying the same.

If I don't own IDA and have architecture not supported by Capstone [...] I need to dig very deep into many aspects of your decompiler to interface it to something else

Currently this decompiler only support intel assembly (and not all instructions either), so if your goal is to decompile anything else you will need to write the disassembler-to-ir code for whichever combination of host software and assembly language you wish to decompile.

Ok, if your IR supports all those features, can you please consider extending the syntax, and adding parsing support for that (I assume dump support already work)?

The IR is not meant to be parsed from text with ir_parser.py, that is only for testing purpose. If you want to parse assembly into IR, you would not go through an intermediary "text" that can be dumped/parsed out. What you would do is write support for your target assembly language in src/ir and then write a host-specific module for the disassembler you want to use in src/host.

pfalcon commented 9 years ago

If you want to parse assembly into IR, you would not go through an intermediary "text" that can be dumped/parsed out.

Sorry, that's exactly what I will do, and that's the basic requirement. It's complex stuff, so having good (human-friendly) representation for intermediate steps is vital. Also, nobody will be able to write "decompiler for everything", so that reduces people to writing "decompiler for X", and that immediately drastically prunes target user base and the reason the decompilation is where it is, with unmaintainable C crapware like Boomerang in ashes for a decade, and bunch of folks writing new crippled toy-likes, e.g. this dude https://github.com/electrojustin/triad-decompiler has a segfaulting thing which can decompile (simple) loops, but can't eliminate superfluous assignments because it doesn't do SSA, yours can do well with contracting expressions in acyclic code, but doesn't do loops, etc., etc.

The only solution to that problem is to completely decouple "decompiler" from "convert machine-specific asm to a generic IR" part. Then maybe there will be critical mass to work on "decompiler" part. It's oh so sad that people don't see this obvious solution ;-).

EiNSTeiN- commented 9 years ago

The only solution to that problem is to completely decouple "decompiler" from "convert machine-specific asm to a generic IR" part. Then maybe there will be critical mass to work on "decompiler" part. It's oh so sad that people don't see this obvious solution ;-).

These 2 parts are well decoupled in my code already. I guess you could write a dumper and parser for IR as it is now, but currently the ir_parser.py is just a toy for testing purpose, it was never meant to parse text for decompilation purpose. I use it mostly for testing the SSA form.

Right now the only way to dump out IR (or any other intermediate decompilation step) is to use the C output class (src/output/c.py) but that is a lossy translation as it's meant to look like readable C. As I mentioned it will lose tons of information about operands.

You could very well write a new output module that is more verbose just for IR. I would merge a PR for this without problem, but it's not on my roadmap to write one.

pfalcon commented 9 years ago

Thanks for explanation. I'll think about it, but lack of loop support prioritizes getting back to looking at other folks' stuff. For reference, proper conversion out of SSA in presence of loops is where I stuck with my crippled toy, a compiler-in-python https://github.com/pfalcon/llvm-codegen-py . I smartly left boring parts to things like clang, and thought that using LLVM IR which is already in SSA will make my task much easier. Turns out, conversion out of SSA requires about same effort and similar algos as converting to SSA, and actually one algo is similar to register allocation, so it itches to combine them, but then it only gets more complex... end result: project stuck.

Ah, and also for reference, next in my queue is https://github.com/pfalcon-mirrors/decomp-6502-arm . That does conversion out of SSA, but bugs were reported for loops, surprise. Funnily, the guy eventually just deleted the repo, prompting me to mirror this GPL code, as a tribute to vain community efforts ;-).