CERTCC / kaiju

CERT Kaiju is a binary analysis framework extension for the Ghidra software reverse engineering suite. This repository is the primary, canonical repository for this project -- file bug reports and wishes here!
Other
268 stars 22 forks source link

Best way to use Kaiju/FN2Hash to match between two programs? #31

Open MattMills opened 1 year ago

MattMills commented 1 year ago

Is your feature request related to a problem? Please describe. I currently use fn2hash/kaiju to reconcile functions/symbols between a version of an application that doesn't have any symbols, and a version that has debug symbols. As these are slightly different versions compiled on the same OS with the same compiler, it is highly successful.

However, what I don't see within Kaiju is how to "import" or resolve the existing list of fn2hashes against a second application in a useful way. Currently I use a self-built postgres database to match fn2hashes across multiple versions, as my primary use case was resolving symbols from an unlabeled stack trace (it has export symbols + an offset, and my web app resolves the offset into the actual address and then the relevant symbol via fn2hash or some other custom code).

Describe the solution you'd like fn2hash identifies many of the useful functions, it seems like all it needs is a mechanism to apply fn2hash.csv from one program and all the symbols (and probably other ghidra stuffs would be useful, like creating functions where they don't exist) onto a second program, or perhaps being able to keep a database of multiple fn2hash sets from different programs and being able to match them during the analysis process.

Describe alternatives you've considered I've considered writing a python script to apply the symbol data naively using just the address of known function matches, but I figured their may be better solutions, so I thought I'd bring it up here and see if it sparked any interest or suggestions.

Additional context

sei-gwassermann commented 1 year ago

Thank you @MattMills for reaching out! I am glad to hear Kaiju and fn2hash have been useful to you. I think there's two potential answers to your question that I can give, and hopefully that will help us decide a good path forward!

One, if you're comparing against a small known set of binaries, there is an experimental (read: still kinda buggy & simple) tool in Kaiju called FSE (Function Set Extractor) which will run fn2hash and compare them against all the binaries imported into the currently open Ghidra project. It right now only displays a simple grid representing which hashes are shared among which files. We've had some internal discussion about improving this tool, I have some notes that I'm working from, but appreciate feedback if you have any on how to make the tool more useful and user-friendly.

Second, we've also discussed a cross-project database of hashes that could be loaded. Ghidra supports sharing of this type of data between clients when run in server mode, so that was an idea I've explored a little but haven't released anything yet. If that is along the lines you're thinking, then that might confirm it as a good direction to keep working in!

MattMills commented 1 year ago

Thanks @sei-gwassermann

I've messed around a bit in previous versions with the xref viewer but I couldn't get it to work; looking at the function intersection viewer on https://insights.sei.cmu.edu/blog/introducing-cert-kaiju-malware-analysis-tools-for-ghidra/ I don't think it'd be particularly useful for my use case only because I'm looking at relatively large binaries (about 90k functions).

A cross project database sounds useful, but I haven't tried using ghidra in server mode since it's just me. I mean, what would be very ideal would be a public central database server where the analyzer could submit it's results to and then get any (exact) matching hashes back with some info on the source executable or symbols if available, but I'd guess you wouldn't be particularly interested in building something with that kind of open access (for malware purposes).

(And I've just realized that I also have some of my own export script that export the function name for all functions and their address.) For me simply being able to take the export CSV from function hash viewer (+ and adding in the function name) and sticking it somewhere so it can be matched against future analysis (+ and the known symbol applied for exact matches / functions defined and disassembled if they're identified where they weren't by ghidra autoanalysis).

Currently I iterate through the hashes looking for a match from most to least specific, and if there is only 1 result then I assume it is likely correct, since all the disassembly I'm looking at is the same thing but slightly different versions or built for a different OS, or built with slightly different included libraries. Since I have a PDB for one version, that version has a ton more useful info and is where I do most of my work.

Hope that is useful context. I think if you guys aren't interested in pursuing I will take a simple approach of resolving addresses between two versions using hashes in a postgres database, exporting a function symbol / address mapping and writing a python script to import the simple symbols on those addresses in Ghidra; but there is clearly a lot more information that could be extracted and imported after correlating the code bits using fn2hashes.