clearbluejar / ghidriff

Python Command-Line Ghidra Binary Diffing Engine
https://clearbluejar.github.io/ghidriff/
GNU General Public License v3.0
507 stars 21 forks source link

Add a way to export matches #93

Closed m417z closed 4 months ago

m417z commented 4 months ago

The currently generated JSON file contains lots of useful information about changed functions, but it lacks information about unchanged matched functions. Such information can be useful for tasks such as migrating function offsets to other versions of an executable.

I played around with the current version of ghidriff, and adding the following snippet:

        with open('matches.json', 'w') as f:
            matches_serializable = {}
            for match_addrs, match_types in matches.items():
                matches_serializable.setdefault(str(match_addrs[0]), []).append([
                    str(match_addrs[1]),
                    match_types,
                ])
            json.dump(matches_serializable, f, indent=2)

After this loop: https://github.com/clearbluejar/ghidriff/blob/4979ec0678f026ea60880796b5156745e23504e6/ghidriff/version_tracking_diff.py#L238-L243

Worked great for me.

The match types are important to have to assess the confidence of each match.

The data can be written to a separate JSON file, or just added to the JSON that's already being created. In my test, the size of the matches JSON is around 0.6% of the size of the generated JSON, so the report size will stay nearly unchanged.

clearbluejar commented 4 months ago

That matches.json creation looks good. A few clarifications.

The provided implementation provides address matches (match_addrs[0]), rather than function name (getName(True))? would it be useful to have matches-addresses.json and matches-names.json?

m417z commented 4 months ago

When we discussed this, you said that given one executable with symbols, and one without, it's best to analyze the diff without providing the symbols for the first executable. It worked well for me. In this case, there aren't many function names. So at least for me for this use case, there would be no use for matches-names.json.

clearbluejar commented 4 months ago

agreed, for the case of missing symbols, address is the better choice.

I guess in your case though you had a list of known addresses you wanted to match? Is that right?

There might be a scenario in which only the function names that need to be matched are known. I will play around with it a bit and see what works. At minimum provide the address matches, and consider adding names matches.

m417z commented 4 months ago

I guess in your case though you had a list of known addresses you wanted to match? Is that right?

That's right, I already had the mapping I needed for the first binary in a convenient format, so both symbols and addresses. But addresses are really more reliable in this case, especially with C++ symbols with function overloads, mangling, etc.

clearbluejar commented 4 months ago

@m417z , let me know what you think of #94

I could make it optional in future, but for now I just added another json output with both address and name. versions. Will review later to determine whether or not to make it default on or flag enabled.

m417z commented 4 months ago

Looks good. I won't be using name_matches, but address_matches is exactly what I need and I'll be able to stop using a fork, so that's good enough for me.