NationalSecurityAgency / ghidra

Ghidra is a software reverse engineering (SRE) framework
https://www.nsa.gov/ghidra
Apache License 2.0
51.28k stars 5.84k forks source link

Multiple armv8 SIMD integer instructions decompiled wrong #3957

Open rfalke opened 2 years ago

rfalke commented 2 years ago

Describe the bug There is a test subject which is so constructed that each function tests a specific armv8 SIMD integer instruction. If the decompiler is correct it will write 3 zeros to global variables. ghidra doesn't so this in call cases.

To Reproduce Decompile https://github.com/rfalke/decompiler-subjects/blob/master/from_holdec/stress_arm64/arm64_macho_simdInt_single_inst__1_var/subject.exe

Expected behavior All functions should be decompiled like

undefined8 _inst_7_var_0(void)

{
  _dest_gp = 0;
  _dest_high = 0;
  _dest_low = 0;
  return 0;
}

Instead sometimes ghidra doesn't have an implementation for an instruction like in:

undefined8 _inst_64_var_0(void)

{
  undefined auVar1 [16];

  auVar1 = NEON_clz(CONCAT115(0x82,CONCAT114(0x47,CONCAT113(0x4b,CONCAT112(0x48,CONCAT111(0xb8,
                                                  CONCAT110(0x58,CONCAT19(0x53,CONCAT18(0xe2,
                                                  0x8eac03f1892825da)))))))),2);
  _dest_gp = 0;
  _dest_high = (SUB168(auVar1 >> 0x40,0) ^ 0xceb48b931707638) + 0x7db8b4b647a7ac1d;
  _dest_low = (SUB168(auVar1,0) ^ 0x8eeb03b931707638) + 0x7153fc0876d7da28;
  return 0;
}

Or (and this is the scope of this bug report) decompiles it wrong. As in:

undefined8 _inst_33_var_0(void)

{
  _dest_gp = 0;
  _dest_high = 0x5e77ffff556d0000;
  _dest_low = 0;
  return 0;
}

Attachments

ghidra_armv8_issues.txt

Environment (please complete the following information):

CmP-lt commented 1 year ago

It looks like implementations of most instructions that are listed in ghidra_armv8_issues.txt don't pass internal tests due to one or more of the following reasons (as can be checked in AARCH64neon.sinc):

@GhidorahRex are there any workarounds for instructions with semantics that can't be implemented accurately due to absence of native support for floating-point/integer rounding modes or it is more of a fundamental limitation that can't be overcome without extending sleigh?

GhidorahRex commented 1 year ago

@CmP-lt For the most part, the workaround is to use a pseudoop. I recognize that it's going to lead to some ugly decompilation for some of those situations. Rounding modes and saturation are tricky, and sleigh doesn't really have the intrinsics to make those instructions play nicely without a lot of additional pcode - and that will most likely end up making the decompilation look just as ugly in different ways.

If something is implemented and wrong though, that's a bigger cause for concern.

CmP-lt commented 1 year ago

So, as far as I understood, the options for instructions with nearly correct implementations that can't be corrected without a lot of additional pcode (that will end up causing ugly decompilation in cases when things can't be simplified out) are to leave them as is or to replace their implementation with pcode op. Neither option will allow decompilation to match expected one from issue description, so the choice in context of this issue is between wrong decompilation and decompilation with pcode ops that can't be simplified out.

GhidorahRex commented 1 year ago

It's a complicated issue. In one processor, we have a flag that allows the switch of emulated vs. decompiled code output. The emulated output is "technically correct" and the decompiled code is simplified. But that sort of thing doesn't necessarily work well for math-heavy applications where you're trying to represent complex arithmetic succinctly in decompiled code and very accurate.

For saturation, the workaround may be to actually implement it. I tried to do that for one of the neon instructions, sqadd, in the sample. It worked very well (decompile was much improved) but the answers weren't quite right because the saturation arithmetic is still funky.

GhidorahRex commented 1 year ago

Also, I should add, one of the only reasons that the decompile was improved is because the tests were using constants. If the values were passed into the functions, most likely the decompilation would be much uglier.

Another solution for saturation is pushing the pcodeop to only the saturation clamping, if we want to avoid a lot of the uglier pcode around that.

Here is my implementation, in all of it's unoptimized glory. Note that this only applies to test 507. A more portable solution, or a few more macros, would be needed:

macro sSatAdd(operand1, operand2, result) {
    local up_limit:8 =  0x7fffffffffffffff;
    local low_limit:8 = 0x8000000000000000;
    local sum:8 = operand1 + operand2;
    local test_up:1 = ((operand1 ^ operand2) & low_limit) == 0;
    local test_low:1 = ((sum ^ operand1) & low_limit) != 0; 
    local res:8 = (zext(test_up & test_low & (operand1 s>= 0)) * up_limit) + (zext(test_up & test_low & (operand1 s< 0)) * low_limit) + (zext(!test_up | !test_low) * sum);
    result = res;
}

:sqadd Rd_VPR128.2D, Rn_VPR128.2D, Rm_VPR128.2D
is b_3131=0 & q=1 & u=0 & b_2428=0xe & advSIMD3.size=3 & b_2121=1 & Rm_VPR128.2D & b_1115=0x1 & b_1010=1 & Rn_VPR128.2D & Rd_VPR128.2D & Zd
{
    local operand1:8 = Rn_VPR128.2D(0);
    local operand2:8 = Rm_VPR128.2D(0);
    sSatAdd(operand1, operand2, Rd_VPR128.2D[0,64]);
    operand1 = Rn_VPR128.2D[64,64];
    operand2 = Rm_VPR128.2D[64,64];
    sSatAdd(operand1, operand2, Rd_VPR128.2D[64,64]);
}

and here's a pcodeop variation:

define pcodeop sSatAdd;

:sqadd Rd_VPR128.2D, Rn_VPR128.2D, Rm_VPR128.2D
is b_3131=0 & q=1 & u=0 & b_2428=0xe & advSIMD3.size=3 & b_2121=1 & Rm_VPR128.2D & b_1115=0x1 & b_1010=1 & Rn_VPR128.2D & Rd_VPR128.2D & Zd
{
    local operand1:8 = Rn_VPR128.2D(0);
    local operand2:8 = Rm_VPR128.2D(0);
    Rd_VPR128.2D[0,64] = sSatAdd(operand1, operand2, 64:1);
    operand1 = Rn_VPR128.2D[64,64];
    operand2 = Rm_VPR128.2D[64,64];
    Rd_VPR128.2D[64,64] = sSatAdd(operand1, operand2, 64:1);
}
CmP-lt commented 1 year ago

Worth to note that scope of this issue is broader than only implementations of some AARCH64 instructions, apparently there are test cases that are decompiled incorrectly despite implementations of tested instructions being [supposedly] correct.

For example, test case 33 (function _inst_33_var_0), despite addp instruction that is used in it having correct implementation, is decompiled as:

undefined8 _inst_33_var_0(void)

{
  _dest_high = 0x5e77ffff556d0000;
  _dest_low = 0;
  _dest_gp = 0;
  _dest_flags = 0;
  return 0;
}

Other example that appears in several test cases is wrong decompilation that references address in register space that is not used anywhere in function's raw pcode as happens in test case 229 (function _inst_229_var_0) that is decompiled as follows:

undefined8 _inst_229_var_0(void)

{
  ulong uVar1;
  short sVar2;
  short sVar3;
  short sVar4;
  short in_register_00005366;

  sVar2 = in_register_00005366 * -0x907 + -0x907;
  uVar1 = (ulong)CONCAT22(in_register_00005366 * -0x3ef9 + -0x3ef9,sVar2);
  sVar3 = in_register_00005366 * 0x29dd + 0x29dd;
  sVar4 = in_register_00005366 * 0xbdf + 0xbdf;
  _dest_high = 0;
  _dest_low = (CONCAT17((char)((ushort)sVar4 >> 8),
                        CONCAT16((char)sVar4,
                                 CONCAT15((char)((ushort)sVar3 >> 8),
                                          CONCAT14((char)sVar3,
                                                   CONCAT13((char)(uVar1 >> 0x18),
                                                            CONCAT12((char)(uVar1 >> 0x10),sVar2)) ))
                                )) ^ 0x6d5b8d83119e1a6b) + 0xa5805610b7e523e9;
  _dest_gp = 0;
  _dest_flags = 0;
  return 0;
}

So I think that discussing all causes of incorrect decompilation of test cases from provided test subject in one issue isn't the best option, instead it makes more sense to create separate issue for each identified cause or group of causes. I plan to do that at least for 2 causes illustrated above, but will first need to do some research on them (in hope to get any insights that might be helpful for diagnosing the issues).

GhidorahRex commented 1 year ago

I agree, there's multiple colliding factors here.

inst_33_var_0 is interesting. One of us probably needs to take a closer look at what's going on there, because the pcode is definitely correct for addp and the calculations are correct in the function, but somehow the right answer isn't getting generated.

The issue with inst_229_var_0 is more straightforward. The register address is used by the raw pcode, but it's incorrect. The calculation for the element in mla is off. It multiplies bits 16-20 by 32 to get the register address, rather than bits 19-20. Fixing the subconstructors starting at line 2728 in AARCH64instructions.sinc will fix the issue:

@if DATA_ENDIAN == "little"
Re_VPR128Lo.H.sel: Re_VPR128Lo, val is Re_VPR128Lo & b_2223=2 & b_2121 & b_1111 [ val = 0x5000 + 32*Re_VPR128Lo + (b_1111 * 2 + b_2121)*2; ] { export *[register]:2 val; }
Re_VPR128Lo.H.sel: Re_VPR128Lo, val is Re_VPR128Lo & b_2223=1 & b_2121 & b_1111 & b_2020 [ val = 0x5000 + 32*Re_VPR128Lo + (b_1111*4 + b_2121*2 + b_2020)*2; ] { export *[register]:2 val; }
Re_VPR128Lo.H.sel: Re_VPR128Lo, val is Re_VPR128Lo & b_2223=0 & b_2121 & b_1111 & b_2020 [ val = 0x5000 + 32*Re_VPR128Lo + (b_1111*4 + b_2121*2 + b_2020)*2; ] { export *[register]:2 val; }
@else
Re_VPR128Lo.H.sel: Re_VPR128Lo, val is Re_VPR128Lo & b_2223=2 & b_2121 & b_1111 [ val = 0x501e + 32*Re_VPR128Lo - (b_1111 * 2 + b_2121)*2; ] { export *[register]:2 val; }
Re_VPR128Lo.H.sel: Re_VPR128Lo, val is Re_VPR128Lo & b_2223=1 & b_2121 & b_1111 & b_2020 [ val = 0x501e + 32*Re_VPR128Lo - (b_1111*4 + b_2121*2 + b_2020)*2; ] { export *[register]:2 val; }
Re_VPR128Lo.H.sel: Re_VPR128Lo, val is Re_VPR128Lo & b_2223=0 & b_2121 & b_1111 & b_2020 [ val = 0x501e + 32*Re_VPR128Lo - (b_1111*4 + b_2121*2 + b_2020)*2; ] { export *[register]:2 val; }
@endif
CmP-lt commented 1 year ago

The register address is used by the raw pcode, but it's incorrect.

You are right, this one is an oversight from my side, thanks for correction. So this cause is also something to be fixed in sleigh. For completeness here is the list of test cases that have wrong decompilation due to this cause:

CmP-lt commented 1 year ago

It appears that several cases, including _inst_33_var_0, have been fixed by changes in decompiler since 10.3 (most likely in 8977840661f66050055bf17985864fb5973d07ff). Fixed cases include: 33, 248, 788, 830, 1054, 1058, 1081, 1085, 1087.

But it's also worth to add that case _inst_27_var_0 that also tests addp instruction (but with different operand types) decompiles not as expected both in 10.3 and with latest changes to decompiler. It's decompilation looks correct, but something prevents complete simplification of upper 8 bytes:

undefined8 _inst_27_var_0(void)

{
  undefined auVar1 [16];

  auVar1[8] = 0xac;
  auVar1._0_8_ = 0x47ba0b5ab88a1aa1;
  auVar1[9] = 0xc1;
  auVar1[10] = 0xb;
  auVar1[11] = 0x81;
  auVar1[12] = 0x3e;
  auVar1[13] = 0x62;
  auVar1[14] = 0x68;
  auVar1[15] = 0xee;
  _dest_high = auVar1._8_8_ + 0x11979dc17ef43e54;
  _dest_low = 0;
  _dest_gp = 0;
  _dest_flags = 0;
  return 0;
}
GhidorahRex commented 1 year ago

I fixed the mlaregister issue in de6ff8440dcbb315f55343856a15cbdaf79a00e5

rfalke commented 1 year ago

For ghidra_10.1.2_PUBLIC the summary is:

correct (with the expected zeros):         422
incorrect (with some constant):            43
incorrect (with some complex computation): 623

For ghidra_10.3.2_PUBLIC the summary is:

correct (with the expected zeros):         445
incorrect (with some constant):            44
incorrect (with some complex computation): 599

The new file ghidra_armv8_issues_ghidra_10.3.2_PUBLIC.txt has the information about the 44 cases where the computation is wrong but not so complex.

CmP-lt commented 1 year ago

Most of remaining cases from group "incorrect (with some constant)" can at best be partially fixed to fall into category of functions that are decompiled with computations by adding various pcodeops to represent rounding modes. But even such partial measure should be better than nothing, because currently those instructions simply have missing operations (i.e. rounding) in their implementations.