NationalSecurityAgency / ghidra

Ghidra is a software reverse engineering (SRE) framework
https://www.nsa.gov/ghidra
Apache License 2.0
50.13k stars 5.74k forks source link

Ghidra does not find C strings #3508

Closed patric-r closed 2 years ago

patric-r commented 2 years ago

Describe the bug In order to provide a binary which reproduces the bug #3498, I started to create 'fuzzied' binaries. When loading such a binary into Ghidra, it takes a long time to process a rather simple binary (12 MB) it consumes a lot of cpu and the UI becomes unresponsive several times but at the end it finishes analysis.

The binary contains and uses many C strings (just hex numbers with no more than 3 digits) as an argument to prinf(). however - I cannot see them (except the first string "Hi there!"") in the "Defined Strings" view.

Why?

Environment (please complete the following information):

patric-r commented 2 years ago

ghidra_bug_3508_binary_1.zip Attached binary which shows the issue.

dragonmacher commented 2 years ago

The binary contains and uses many C strings (just hex numbers with no more than 3 digits) as an argument to prinf(). however - I cannot see them (except the first string "Hi there!"") in the "Defined Strings" view. Why?

The Defined Strings view only shows strings that have had string data types applied. If there are more strings in the binary that are not appearing in that view, then Ghidra analysis has not found them and created string data types.

You can try using the Search -> For Strings... to find strings in the binary that Ghidra has not identified.

patric-r commented 2 years ago

All Strings are used as an argument to printf and are null-terminated. I assumed that Ghidra "knows" that this must be string because it's used as an argument to printf() which takes a String.

Why does Ghidra identify the first String, "Hi there!" automatically which is used in the same way as all other strings? Is it because that the other strings are shorter (less than 4 characters)?

dragonmacher commented 2 years ago

Why does Ghidra identify the first String, "Hi there!" automatically which is used in the same way as all other strings? Is it because that the other strings are shorter (less than 4 characters)?

I haven't looked into it, but the length is likely the issue. The Search For Strings widget will let you change the size threshold. From that widget, you can tell Ghidra to create any strings it finds. As far as analysis, It looks like the Ascii Strings Analyzer has the default length set to 5; with a lower bound of 4.

patric-r commented 2 years ago

You're right, it was the string length.

I adjusted my synthetic binary (now using string length>4 characters) which might be interesting for optimizing ghidra: Now we have a win64 binary of 31 MB. ghidra_bug_3508_binary_2.zip

Observerations when analyzing it in Ghidra:

dragonmacher commented 2 years ago

Thanks for the info.

(unfortunately, it does not reproduce bug 'Defined Strings' view causes Ghidra to hang / 100% CPU for one thread forever #3498)

That is a shame. My hunch is that there must be really large data types in the original binary.

The memory consumption seems a bit odd. I wonder if that is just the garbage collector's behavior during intense cpu processing, not reclaiming memory, since it had plenty to spare?

patric-r commented 2 years ago

The memory consumption seems a bit odd. I wonder if that is just the garbage collector's behavior during intense cpu processing, not reclaiming memory, since it had plenty to spare?

Nope, I had to increase max heap size from initially 8 GB to 16 GB which still wasn't sufficient and with 28GB it finally worked. Ghidra really needs that amount of memory for that binary.

dragonmacher commented 2 years ago

Interesting. Seems that we have an opportunity for improvement.

patric-r commented 2 years ago

Indeed. Because we are seeing two spikes in the heap usage diagram, it looks like two analyzers might have room for improvement. BTW, I used the default analyzer selection.

astrelsky commented 2 years ago

One of the culprits is most likely Non-Returning Functions - Discovered. It's recursive and incorrectly marks everything as no-return. Everyone already know they have to disable it though.

:p

astrelsky commented 2 years ago
2021-10-14 20:59:51 INFO  (AutoAnalysisManager) -----------------------------------------------------
    ASCII Strings                              3.688 secs
    Apply Data Archives                        0.971 secs
    Call Convention ID                         0.282 secs
    Call-Fixup Installer                       0.005 secs
    Create Address Tables                      5.153 secs
    Create Function                           10.007 secs
    DWARF                                      1.037 secs
    Data Reference                            28.863 secs
    Decompiler Switch Analysis                 6.675 secs
    Decompiler Switch Analysis - One Time      1.700 secs
    Demangler Microsoft                        0.039 secs
    Disassemble Entry Points                  62.149 secs
    Disassemble Entry Points - One Time        0.030 secs
    Embedded Media                             0.343 secs
    External Entry References                  0.001 secs
    Function ID                               10.125 secs
    Function Start Search                      0.169 secs
    Non-Returning Functions - Discovered      11.621 secs
    Non-Returning Functions - Known            0.048 secs
    PDB Universal                              0.003 secs
    Reference                                 84.862 secs
    Scalar Operand References                 42.975 secs
    Shared Return Calls                        2.136 secs
    Stack                                    134.548 secs
    Subroutine References                      5.758 secs
    Subroutine References - One Time           0.003 secs
    Windows x86 PE Exception Handling          1.224 secs
    Windows x86 PE RTTI Analyzer               0.077 secs
    WindowsResourceReference                   0.272 secs
    x86 Constant Reference Analyzer          356.919 secs
-----------------------------------------------------
     Total Time   771 secs
-----------------------------------------------------

2021-10-14 20:59:51 DEBUG (ToolTaskManager) Thu Oct 14 20:59:51 EDT 2021 Auto Analysis task finish (771.829 secs)  

I actually tried to profile it for several hours. I thought since I no longer had a potato I could just profile all the ghidra classes and be fine. Well yea no it wasn't. Just disassembling took a few hours with that profiling setup so I figured I would at least provide this for now.

So yeah, my above assumption, totally wrong.

ryanmkurtz commented 2 years ago

Fixed by 0d676540a000493855e3ecc5b57d706d2456e70e