arizvisa / ida-minsc

A plugin based on IDAPython for a functional DWIM interface. Current development against most recent IDA is in the "persistence-refactor" branch, ancient (but stable) work is in "master", so... create an issue if you want/need something backported. Use "Wiki" or "Discussions" for examples, and smash that "Star" button if you like this.
BSD 3-Clause "New" or "Revised" License
313 stars 53 forks source link

Feature: More support for the Hex-Rays decompiler #14

Open arizvisa opened 5 years ago

arizvisa commented 5 years ago

There's some basic support for interacting with the decompiler within the function module, but it's very minimal and is only used to match instructions to lines of code. I've had a number of issues with instability (crashy-crashy) when calling the different functions that are exposed by IDAPython and so I'm always very hesitant to script it.

Another reason why I'm hesitant to use the decompiler is that although I have a license for it, it's never used because I (straight-up) prefer assembly listings due to the implicit information that is exposed to the user such as function locality (boundaries for object files before the linker gets ahold of them), or the hot-paths as determined by the compiler.

Nonetheless, the decompiler is a significant part of IDA and thus it deserves its place within ida-minsc. Unfortunately, I'm not sure what's useful to expose to a user because of the prior mentioned reasons. If anybody has any suggestions or wants to contribute, please let me know in this thread.

arizvisa commented 3 years ago

The ctree API that hex-rays exposes is very clumsy with the current Python interface. In general the pattern matching that's available in various functional programming languages would be ideal for parsing the hexrays tree. In my head I'd probably choose something similar to ML's syntax for pattern matching, but Mathematica definitely has the most powerful and flexible interface. Although the minsc plugin does include some support for basic pattern matching, it doesn't support wildcards for expressions which is the major reason why one would want to pattern match to begin with.

As mentioned in patois/HexRaysToolbox#1, PEP-0636 (https://www.python.org/dev/peps/pep-0636/) and friends exposes a pattern-matching syntax to Python which might allow a more reasonable way to query things in Hex-Rays' expression tree. Sympy was suggested, but I couldn't get sympy to match more than once (and it was stupid memory intensive and non-performant anyways).

Anyways, if PEP-0636 becomes a thing and we'd want to wrap a layer around Hex-Rays, then this means that minsc would need to be ported to Python3, and I'm really not ready for that fight at the moment.

arizvisa commented 1 year ago

As per #158, this next refactor is working towards proper support of the Hex-Rays decompiler. As opposed to using PEP-0636, I've instead decided upon using something based on "tree-sitter" (https://tree-sitter.github.io/tree-sitter/) to actually match against the CTree API. Tree-sitter's fuzzy matching is actually super-fucking awesome and is a great way to isolate specific code snippets that have been generated by the decompiler.

The idea is that you'd use a Tree-sitter query (you'd write your query in C) to do a higher-level match and then from those results you extract the tokens you want. These tokens can then be converted to the Hex-Rays microcode so you can examine the details if you need them. Currently it's possible to get the references for a microcode instruction using the use-def and def-use chains (to get them by basic-block), but I plan to eventually bring these down from basic-block to the actual microinstruction.

Eventually if I find a use for it, I might bring them into SSA form..but at the moment I haven't found a a real need for it since typically I only care about the inputs that feed into an expression between functions rather than trying to do full-blown program analysis.

arizvisa commented 8 months ago

Currently, I've switched away from using the lvar_t in favor of the lvar_locator_t. This is because if a function or its mba gets re-generated, the lvar_t goes out of scope. Essentially, lvar_t is weakly-referenced and doesn't have any real meaning if the decompiler isn't being used.

Still interested in using tree-sitter for parsing, though. I'm hestitant about it because I don't want to have to maintain a tree-sitter grammar specifically for Hex-Rays. Personally, I'd rather have something similar to prolog for performing queries. It straight-up feels more elegant. However, normies will probably not understand why it's better so tree-sitter is probably the best bet. There's also something to be said about using a syntax-highlighter's tokens for performing actual queries.