clearbluejar / ghidriff

Python Command-Line Ghidra Binary Diffing Engine
https://clearbluejar.github.io/ghidriff/
GNU General Public License v3.0
507 stars 21 forks source link

Add option to use PDB MSDIA instead of PDB Universal #65

Open justanotheranonymoususer opened 9 months ago

justanotheranonymoususer commented 9 months ago

For large binaries, Universal fails with OOM. See: https://github.com/NationalSecurityAgency/ghidra/issues/2485

For this reason I couldn't try this tool with my binary.

Please add a command line option to switch to MSDIA.

image

clearbluejar commented 9 months ago

This request speaks to a larger requirement to be able to provide custom analyzer options to ghidriff, which I have been meaning to do and shouldn't be too hard. As I already set some custom ones.

For example. If you save the options for the screenshot above it generates a custom options file like:

{
  "SAVE_STATE_NAME": "File_Options",
  "VALUES": {
    "WindowsPE x86 Propagate External Parameters": true,
    "Aggressive Instruction Finder": true,
    "PDB Universal.Search remote symbol servers": true,
    "Condense Filler Bytes": true,
    "Decompiler Parameter ID": true,
    "Variadic Function Signature Override": true,
    "PDB MSDIA": true
  },
  "TYPES": {
    "WindowsPE x86 Propagate External Parameters": "boolean",
    "Aggressive Instruction Finder": "boolean",
    "PDB Universal.Search remote symbol servers": "boolean",
    "Condense Filler Bytes": "boolean",
    "Decompiler Parameter ID": "boolean",
    "Variadic Function Signature Override": "boolean",
    "PDB MSDIA": "boolean"
  },
  "ENUM_CLASSES": {}
}

I think in short order I could support that in ghidriff, as a command line option to supply custom analysis. What do you think?

Alternatively, at the moment, if you want to try your already analyzed file in Ghidra. Just export the binary / each binary to a Ghidra Zipped format. See the latest release picture. You can export the binary to my_large_bin1.gzf and my_large_bin2.gzf. Then you can pass the already analyzed bins to to ghidriff for diffing.

ghidriff my_large_bin1.gzf my_large_bin2.gzf

I just put this out though, so I am curious of the results. Let me know if you try it and if it works for you. Based on your feedback, I'll likely create a ticket to support custom analysis options generally.

justanotheranonymoususer commented 9 months ago

"I think in short order I could support that in ghidriff" - sounds good, maybe sth like:

--analysis-option="PDB MSDIA=true"

Or a json that will be used to override options.

"if you want to try your already analyzed file in Ghidra" - frankly I already used bindiff, but I'll try that later.

justanotheranonymoususer commented 9 months ago

Download of pdbs always fails for me, I had to use another tool to download:

INFO | ghidriff | Setting up Symbol Server for symbols...
INFO | ghidriff | path: ghidriffs\symbols level: 1
INFO | ghidriff | Symbol Server Configured path: SymbolServerService:
        symbolStore: LocalSymbolStore: [ rootDir: C:\Users\User\Desktop\diff2\ghidriffs\symbols, storageLevel: -1],
        symbolServers:
                HttpSymbolServer: [ url: https://msdl.microsoft.com/download/symbols/, storageLevel: -1]
                HttpSymbolServer: [ url: https://chromium-browser-symsrv.commondatastorage.googleapis.com/, storageLevel: -1]
                HttpSymbolServer: [ url: https://symbols.mozilla.org/, storageLevel: -1]
                HttpSymbolServer: [ url: https://software.intel.com/sites/downloads/symbols/, storageLevel: -1]
                HttpSymbolServer: [ url: https://driver-symbols.nvidia.com/, storageLevel: -1]
                HttpSymbolServer: [ url: https://download.amd.com/dir/bin/, storageLevel: -1]
INFO  Connecting to https://msdl.microsoft.com/download/symbols/ (ConsoleTaskMonitor)
INFO  Success (ConsoleTaskMonitor)
INFO  Storing <XXX>.pdb in local symbol store (338.91MB) (ConsoleTaskMonitor)
WARN  SymbolServerService: error copying file https://msdl.microsoft.com/download/symbols/<XXX>.pdb/<YYY>/<XXX>.pdb to C:\Users\User\Desktop\diff2\ghidriffs\symbols: closed (SymbolServerService)
INFO  Connecting to https://msdl.microsoft.com/download/symbols/ (ConsoleTaskMonitor)
INFO  Success (ConsoleTaskMonitor)
INFO  Storing <XXX>.pdb in local symbol store (338.91MB) (ConsoleTaskMonitor)

Then I got this assert:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\User\AppData\Local\Programs\Python\Python311\Scripts\ghidriff.exe\__main__.py", line 7, in <module>
  File "C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\ghidriff\__main__.py", line 82, in main
    pdiff = d.diff_bins(diff[0], diff[1])
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\ghidriff\ghidra_diff_engine.py", line 1170, in diff_bins
    assert sym_count_diff < 4000, f'Symbols counts between programs ({p1.name} and {p2.name}) are too high {sym_count_diff}! Likely bad analyiss or only one binary has symbols! Check Ghidra analysis or pdb! Add --force-diff to ignore this assert'
           ^^^^^^^^^^^^^^^^^^^^^
AssertionError: Symbols counts between programs (<XXX>_1.dll and <XXX>-2.dll) are too high 82149! Likely bad analyiss or only one binary has symbols! Check Ghidra analysis or pdb! Add --force-diff to ignore this assert

BTW typo: analyiss

I added --force-diff, now it seems to be working, I'm waiting for it to complete.

clearbluejar commented 9 months ago

Symbols counts between programs (_1.dll and -2.dll) are too high 82149!

If one version has symbols and the other doesn't, it becomes difficult to match the functions because Ghidra will have a different set of functions for each binary. So sometimes functions won't be aligned. That assertion is there to let you know you are stepping into a diff that might not work.

That being said, I have seen even partial diffs be useful. There is also an option to run without symbols (which again sometimes can be best if the analysis with and without symbols is so changed). Everything depends.

clearbluejar commented 9 months ago

Did the diff finish?

justanotheranonymoususer commented 9 months ago

If one version has symbols and the other doesn't

I don't think that's the case, file size is similar. Here are both files: old: https://msdl.microsoft.com/download/symbols/windows.ui.xaml.dll/9C04CA1E1226000/windows.ui.xaml.dll new: https://msdl.microsoft.com/download/symbols/windows.ui.xaml.dll/A6D203221226000/windows.ui.xaml.dll

Did the diff finish?

It failed with:

...
INFO | ghidriff | Completed 5111 at 95%
WARNING| ghidriff | Code diff type not appended for ?close_reset@?$close_invoke_helper@$00P6AXPEAX@_E$1?ReleaseMutex@details@wil@@YAX0@ZPEAX@details@wil@@SAXPEAX@Z due to jumptable decomp issue
WARNING| ghidriff | Code diff type not appended for ?close_reset@?$close_invoke_helper@$00P6AXPEAX@_E$1?CloseHandle@details@wil@@YAX0@ZPEAX@details@wil@@SAXPEAX@Z due to jumptable decomp issue
WARNING| ghidriff | Code diff type not appended for ?OSMemoryFree@XcpAllocation@@YAXPEAX@Z due to jumptable decomp issue
WARNING| ghidriff | Code diff type not appended for ?OSMemoryFree@XcpAllocation@@YAXPEAX@Z due to jumptable decomp issue
WARNING| ghidriff | Code diff type not appended for ?OSMemoryFree@XcpAllocation@@YAXPEAX@Z due to jumptable decomp issue
WARNING| ghidriff | Code diff type not appended for ?ReleaseWeak@control_block@details@xref@@QEAAIXZ due to jumptable decomp issue
WARNING| ghidriff | Code diff type not appended for ?_Tidy@?$vector@Vxstring_ptr@@V?$allocator@Vxstring_ptr@@@std@@@std@@AEAAXXZ due to jumptable decomp issue
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\User\AppData\Local\Programs\Python\Python311\Scripts\ghidriff.exe\__main__.py", line 7, in <module>
  File "C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\ghidriff\__main__.py", line 82, in main
    pdiff = d.diff_bins(diff[0], diff[1])
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\ghidriff\ghidra_diff_engine.py", line 1446, in diff_bins
    pdiff['old_pe_url'] = self.get_pe_download_url(old, pdiff['old_meta'][pe_key])
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\ghidriff\ghidra_diff_engine.py", line 820, in get_pe_download_url
    pe_info = get_pe_extra_data(path)
              ^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\ghidriff\utils.py", line 41, in get_pe_extra_data
    machine = unpack('<H', word)[0]
              ^^^^^^^^^^^^^^^^^^
struct.error: unpack requires a buffer of 2 bytes
clearbluejar commented 9 months ago

ah.. seems like the pe_url generation is failing for that binary.

That isn't a critical function. just gives you a nice wget original binary command line. Like this:

image

Which seems like another issue to resolve. :)

Storing windows.ui.xaml.pdb in local symbol store (338.91MB) (ConsoleTaskMonitor)
The PDB for the binary is 350 MB! wow.

And the binary is 18MB...

I just kicked off a local test. I will see if it survives it.

justanotheranonymoususer commented 9 months ago

That's not so large, chromium pdbs are several GBs

On Wed, Dec 20, 2023, 07:05 clearbluejar @.***> wrote:

ah.. seems like the pe_url generation is failing for that binary.

That isn't a critical function. just gives you a nice wget original binary command line. Like this: image.png (view on web) https://github.com/clearbluejar/ghidriff/assets/3752074/26971955-f1cf-417f-b36a-364aa75fe45e

Which seems like another issue to resolve. :)

Storing windows.ui.xaml.pdb in local symbol store (338.91MB) (ConsoleTaskMonitor) The PDB for the binary is 350 MB! wow.

And the binary is 18MB...

I just kicked off a local test. I will see if it survives it.

— Reply to this email directly, view it on GitHub https://github.com/clearbluejar/ghidriff/issues/65#issuecomment-1863858603, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABMDRPFDKBVWHSNMJ3O3QOLYKJWYXAVCNFSM6AAAAABAXO2YIKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNRTHA2TQNRQGM . You are receiving this because you authored the thread.Message ID: @.***>

clearbluejar commented 9 months ago

This is how analysis is going:

image

I ran out of heap and actually crashed the JVM. This is Ghidra analysis (before ghidriff is doing any work). I can bump up the heap for the jvm, but how much will I need. How much RAM are you working with? I can also turn off threading so it only analyzes one binary at a time with --no-threaded. Trying again.

justanotheranonymoususer commented 9 months ago

Did you use MSDIA? ram I think I used 16GB

On Wed, Dec 20, 2023, 07:26 clearbluejar @.***> wrote:

This is how analysis is going: image.png (view on web) https://github.com/clearbluejar/ghidriff/assets/3752074/afd2f19f-d610-452b-95e3-a23ab1f0a4f3

I ran out of heap and actually crashed the JVM. This is Ghidra analysis (before ghidriff is doing any work). I can bump up the heap for the jvm, but how much will I need. How much RAM are you working with? I can also turn off threading so it only analyzes one binary at a time with --no-threaded. Trying again.

— Reply to this email directly, view it on GitHub https://github.com/clearbluejar/ghidriff/issues/65#issuecomment-1863876339, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABMDRPG7HEKZH7S5AVTRIA3YKJZHLAVCNFSM6AAAAABAXO2YIKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNRTHA3TMMZTHE . You are receiving this because you authored the thread.Message ID: @.***>

clearbluejar commented 9 months ago

ah no, just using command-line on linux, regular pdb universal. maybe it can't handle it...

justanotheranonymoususer commented 9 months ago

Yeah, that's the issue I linked at the beginning

On Wed, Dec 20, 2023, 07:39 clearbluejar @.***> wrote:

ah no, just using command-line on linux, regular pdb universal. maybe it can't handle it...

— Reply to this email directly, view it on GitHub https://github.com/clearbluejar/ghidriff/issues/65#issuecomment-1863886280, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABMDRPGHZUDDKA5PEAIBM3LYKJ2XLAVCNFSM6AAAAABAXO2YIKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNRTHA4DMMRYGA . You are receiving this because you authored the thread.Message ID: @.***>

clearbluejar commented 9 months ago

Full circle. 🤦‍♂️ Sorry.

I have yet to use MSDIA for Ghidra, besides the analysis option needed, and having to run it on Windows (because that is a requirement for MSDIA right?), is there anything else you need to run on the PDB to make it work? Or MSDIA is just another parser for the PDB that handles large ones better, so there is no preprocessing needed, it can just run with the original PDB.

justanotheranonymoususer commented 9 months ago

I think MSDIA is just another parser for the PDB that handles large ones better, so there is no preprocessing needed. And probably Windows only indeed, but I'm not sure.

On Wed, Dec 20, 2023, 07:45 clearbluejar @.***> wrote:

Full circle. 🤦‍♂️ Sorry.

I have yet to use MSDIA for Ghidra, besides the analysis option needed, and having to run it on Windows (because that is a requirement for MSDIA right?), is there anything else you need to run on the PDB to make it work? Or MSDIA is just another parser for the PDB that handles large ones better, so there is no preprocessing needed, it can just run with the original PDB.

— Reply to this email directly, view it on GitHub https://github.com/clearbluejar/ghidriff/issues/65#issuecomment-1863890807, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABMDRPGCJV5ZMLHBTUBWGZTYKJ3ODAVCNFSM6AAAAABAXO2YIKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNRTHA4TAOBQG4 . You are receiving this because you authored the thread.Message ID: @.***>

clearbluejar commented 9 months ago

Will need to get back to you when I can test with Windows. I will try to add the options json import to enable all the Ghidra analysis settings.

justanotheranonymoususer commented 9 months ago

Now Ghidra 11 is released with some pdb improvements, maybe now it won't OOM, worth trying