NationalSecurityAgency / ghidra

Ghidra is a software reverse engineering (SRE) framework
https://www.nsa.gov/ghidra
Apache License 2.0
50.53k stars 5.78k forks source link

Caching/Saving Decompilation & Access via API #6231

Open BlackMagicCoding opened 7 months ago

BlackMagicCoding commented 7 months ago

Ahoi!

Context

I am reversing a fairly complex program since a few months already, which includes library functions for interfacing with the LUA scripting language. So far I already have named and typed a bunch of key LUA functions, which are referenced a lot, so some parameters of other functions calling them should be able to be inferred as well - this is not the case in my repository though. It seems that somehow the parameter types have been hard set to int instead of the undefined4, which would automatically display it as a more discrete type in the decompilation view, if it can be inferred via function calls inside etc. The latter might be caused by me running the Auto Analyze action with the Decompiler Parameter ID option once - oops... could be wrong though. My goal is now to fix those parameter types (and some names) in bulk via script, when I am certain that those can be inferred correctly.

Approach & Example

Here is a yet manually unaddressed function FUN_0048c5b0 which calls 2 LUA functions: lua_pushnumber and lua_pushnil. image As you can see the decompilation writes that param_1 is cast as lua_State *, since both take that as an argument. This way I can infer that param_1 is of type lua_State * and is named L (that's just how things are in those LUA functions). For completeness sake and context, here are the function signatures for both: void __cdecl lua_pushnumber(lua_State *L,lua_Number n) void __cdecl lua_pushnil(lua_State *L)

I already whipped up a very simplistic Ghidra script to gather and print info. This script basically runs over all 29564 functions, gets their decompiled C code and does a String search for (lua_State *) and gets the trailing variable name (if it really is a variable cast). Here is the code for that script (please excuse the very crude code, it's just a work in progress lol):

import ghidra.app.decompiler.DecompInterface;
import ghidra.app.decompiler.DecompileResults;
import ghidra.app.decompiler.DecompiledFunction;
import ghidra.app.script.GhidraScript;
import ghidra.program.model.listing.Function;
import ghidra.program.model.listing.FunctionIterator;

public class RenameRetypeCastsBulk extends GhidraScript {

    public void run() throws Exception {
        int startIndex;
        int endIndex;
        int foundIndex;
        char c;
        String var;
        int cnt = 0;
        int total = getCurrentProgram().getFunctionManager().getFunctionCount();
        String searchForCast = "lua_State *"; // hardcoded value right now, will be dynamic user input later
        int decompileTimeout = 10; // seconds

        DecompInterface decompiler = new DecompInterface();
        decompiler.openProgram(currentProgram);

        FunctionIterator iterator = getCurrentProgram().getFunctionManager().getFunctions(true);
        while (iterator.hasNext() && !getMonitor().isCancelled()) {
            cnt++;
            setToolStatusMessage("Progress: " + cnt + " / " + total, false);
            startIndex = 0;

            Function function = iterator.next();
            DecompileResults decompileResults = decompiler.decompileFunction(function, decompileTimeout, null);
            DecompiledFunction decompiledFunction =  decompileResults.getDecompiledFunction();
            if(decompiledFunction == null) {
                println("Unable to get decompiled function for " + function.getName());
                continue;
            }
            String decompiledCode = decompiledFunction.getC();
            do {
                foundIndex = decompiledCode.indexOf("(" + searchForCast + ")", startIndex);
                if(foundIndex != -1) {
                    c = decompiledCode.charAt(foundIndex + searchForCast.length() + 2);
                    // very simplistic check if cast is followed by a variable
                    // currently a valid variable is considered starting with a letter or underscore, and only consisting of letters/numbers/underscores thereafter
                    if(!Character.isLetter(c) && c != '_') {
                        startIndex = foundIndex + searchForCast.length() + 3; // setting up startIndex to try another search for a variable cast
                        continue; // not a valid variable name - probably casting a scalar value
                    }
                    for(endIndex = foundIndex + searchForCast.length() + 3; endIndex < decompiledCode.length(); endIndex++) {
                        c = decompiledCode.charAt(endIndex);
                        if(!Character.isLetterOrDigit(c) && c != '_') break;
                    }
                    var = decompiledCode.substring(foundIndex + searchForCast.length() + 2, endIndex);
                    println(function.getName() + " found cast of " + var);
                    break; // found cast variable, no need to search further after first hit
                }
            } while(foundIndex != -1);
        }
    }

}

Improvements

As you might guess this is rather slooow. It seems, that it does a fresh decompile of each and every function when calling getDecompiledFunction(), instead of loading something from cache. Even when running the same script again and again, without altering anything. My goal is to adjust the script later to not only print when finding those casts, but instead alter the function parameter via script and then run multiple times afterwards, since those alterations will lead to lua_State * casts popping up at new places, because the altered functions now require them. If I am not completely mistaken I think that IDA saves it's decompilation output persistently in it's database/repository and loads it when looking at the same function again. Such a thing would be a huge boost in performance when either dealing with chonky functions or doing bulk edits like me.

Implementation

I am absolutely aware that this is the exact opposite of a trivial matter, and that there certainly will be some further questions popping up when starting to implement. Here is what came to my mind when thinking about the implementation:

What are your thoughts on this?

Best regards, BMC

BlackMagicCoding commented 3 months ago

Heyo @ryanmkurtz and others, just wanted to poke a little, asking if this is something you still might consider, or if you think it is very unlikely. I am absolutely not in a hurry at all, just wanted to reel in some form of update on it ^^ Best regards, BMC

ryanmkurtz commented 3 months ago

I directed the ticket to the right team member.