NationalSecurityAgency / ghidra

Ghidra is a software reverse engineering (SRE) framework
https://www.nsa.gov/ghidra
Apache License 2.0
51.93k stars 5.89k forks source link

Array indices unrecognised, producing over complex/poor C output #5720

Open Wall-AF opened 1 year ago

Wall-AF commented 1 year ago

Describe the bug In a loop looking at elements of an array, Ghidra isn't realising that a local variable added to an address is actually equivalent to looking at the same field in the next(/previous) array element. (It doesn't seem to be checking the local variables increment (or decrement) against the size of the array's element!) This is producing expressions like (*(char *)((int)&DgnPhoneme_0x4_t_ARRAY_1210_9b34[0].b2 + local_4) == '-') instead of (DgnPhoneme_0x4_t_ARRAY_1210_9b34[nItem].b2 == '-') where nItem is the loop index.

With nested arrays, it gets worse!

To Reproduce Steps to reproduce the behaviour:

  1. Load in the enclosed function (from the Decompile:Panels Debug Function Decompilation menu)
  2. Search down the decompiled code to find the output above
  3. See issue

Expected behaviour Use of correct(/better) array syntax in decompiled C output.

Screenshots N/A

Attachments dragon_FUN_1160_0049.zip

Environment (please complete the following information):

Additional context This is an extract from a Windows NE DLL (circa 1990s)

Wall-AF commented 1 year ago

Here's yet another (simpler) example where the local variable pData is incremented to get to the next element (instead of using the loop counter as the array index).

Wall-AF commented 1 year ago

FYI, in the initial attachment, there's another example where the parameter ((&((DgnPhoneme_0x4_t *)DgnPhoneme_0x4_t_ARRAY_1210_9b34)->b0 + iVar4)) in the function call below

            pDVar5 = getPicPhonemeData((DgnPhoneme_0x4_t *)
                                       CONCAT22((char *)"User_Iterate" + 10,
                                                (DgnPhoneme_0x4_t *)
                                                (&((DgnPhoneme_0x4_t *)
                                                  DgnPhoneme_0x4_t_ARRAY_1210_9b34)->b0 +
                                                iVar4)));

is displaying the same/similar problem!

With the segment issues fixed, it should be decompiled as (something akin to) pDVar5 = getPicPhonemeData(DgnPhoneme_0x4_t_ARRAY_1210_9b34[nItem]);

Wall-AF commented 1 year ago

This might help with your diagnosis as it has arrays that display well (where a value is extracted) and others that don't (I think due to needing their address and it being the result of a calculation) like local_12 = pThis->field10_0x308 + local_8; it shoud be (I believe) local_12 = &pThis->field10_0x308[local_8];


I forgot to say that in this case as local_8 is incremented by 1 - and that is the array element size - the compiler didn't require a second variable to add to the base object as is the norm for arrays of elements of size >1.

Wall-AF commented 1 year ago

Here is a very short function that displays this issue in it's return argument!

Wall-AF commented 1 year ago

Just now I created an all-encompassing structure from a series of memory locations in my data segment and prior to this Ghidra was able to output a reference to an array using the syntax: ((byte *)aBytes)[i] = 0;

Now I get: *(undefined *)((int)g_Obj0x47a_DS_9e86.aBytes + i) = 0;

It seems that Ghidra's lost it's knowledge of the byte array just because it is now an element of the structure! How can this be fixed?

Wall-AF commented 1 year ago

The highlighted section in the image below shows what should be a simple assignment of a newly realloc'd pointer to an object of size nMaxWs * 2 bytes (which is actually an array of size nMaxWs of type (u)int), but is split into two (overcomplicated assignments) just because the return of functions use two 16-bit registers instead of one 32-bit register. image I've used a comment to show the better syntax that Ghidra needs to be capable of producing and an illegal variable name apUnodes[i] to try to help describe what's going on!

This kind of missing simplification by Ghidra gets even worse when in another function the address of a particular element of the reallocd array is required!

Can this kind of issue be overcome?

Wall-AF commented 1 year ago

I believe the highlighted section of the image in the previous comment is incorrect and it should be image

That mistake just illustrates the difficulties trying to correctly interpret this kind of output!

Wall-AF commented 1 year ago

Included here is a follow-up function that uses the allocted memory stored in those pointers. By using illegal variable names I've annotated most of the indexing where possible and in comments otherwise.


An interesting compiler optimisation within this example function is that a couple of local variables are created based upon the address of the first element (beginning of) the this structure. These are then used to point to differently sized (int vs long) array objects embedded within the this structure and are incremented by their appropriate size. (The two examples I've named pThis2Offset&pThis4Offset.) Can this kind of usage be understood and used to produce better C output?