Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

[X86] some builtins generate incorrect code for shifts with large (constant) shift counts #42892

Open Quuxplusone opened 4 years ago

Quuxplusone commented 4 years ago
Bugzilla Link PR43922
Status REOPENED
Importance P normal
Reported by Wolfgang Pieb (Wolfgang_Pieb@playstation.sony.com)
Reported on 2019-11-06 11:16:26 -0800
Last modified on 2019-11-10 03:28:09 -0800
Version trunk
Hardware PC All
CC craig.topper@gmail.com, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, spatel+llvm@rotateright.com
Fixed by commit(s) rG641d2e5232b423a7dd81afac94dd3db4412a4971
Attachments newtest.ll (652 bytes, text/plain)
Blocks
Blocked by
See also
Created attachment 22780
IR in question

The attached IR features a call to a builtin that generates a PSRAD
instruction. The shift count is large, and in such cases the instruction's
definition says that the result should be equal to the operand's sign bit (or
the sign bits of its respective elements) extended to the entire width of the
result (i.e. 0 or -1).

The instruction, however, is emitted with a shift count of 0, yielding a result
identical to the input operand.

        movq    .LCPI0_0(%rip), %mm0    # mm0 = 0x7AAAAAAA7AAAAAAA
        psrad   $0, %mm0

I do not think undefined behaviour is in play here, since I would expect that
the semantics of a builtin are directly derived from the that of the
instruction they represent, though I could be wrong about that.

The IR has been generated from the following C source:

// -----------------------------------------------------
typedef int __v2si __attribute__((__vector_size__(8)));
typedef long long __m64 __attribute__((__vector_size__(8)));

extern "C" int printf(const char *, ...);

int main()
{
            __m64 id17152 = {(long long)0x7aaaaaaa7aaaaaaa};
            int id17156 = 0x10000000;
            __m64 id17151 = (__m64)__builtin_ia32_psradi((__v2si)id17152, id17156);
            printf("id17151 = %llx\n", id17151[0]);
}
// ------------------------------------------------------
gcc 7.4.0 prints 0 for this code.

In a late stage IR dump the PSRAD instruction shows up with the correct shift
count:

renamable $mm0 = MMX_PSRADri killed renamable $mm0(tied-def 0), 268435456
Quuxplusone commented 4 years ago

Attached newtest.ll (652 bytes, text/plain): IR in question

Quuxplusone commented 4 years ago

The PSRAD with immediate instruction only supports 8-bits worth of shift so we masked 268435456 to 8-bits and got 0. The same issue exists for shift left and shift right for the MMX builtin. We do better with SSE shifts.

I'm inclined to fix this by just clamping the shift amount to 31 for all three shift amounts. We could do better for left/right shifts and the result to 0 explicitly, but I'm not sure we care to optimize MMX that well.

Quuxplusone commented 4 years ago
(In reply to Craig Topper from comment #1)
> The PSRAD with immediate instruction only supports 8-bits worth of shift so
> we masked 268435456 to 8-bits and got 0. The same issue exists for shift
> left and shift right for the MMX builtin. We do better with SSE shifts.
>
> I'm inclined to fix this by just clamping the shift amount to 31 for all
> three shift amounts. We could do better for left/right shifts and the result
> to 0 explicitly, but I'm not sure we care to optimize MMX that well.

How tricky would it be to match what we do for SSE for out of bounds values?
Quuxplusone commented 4 years ago

Not that tricky. We need to clamp for psrai. And for the other two emit an i32 zero and an x86isd node to move i32 to mmx. That will pattern match to the pxor zero idiom I think.

Quuxplusone commented 4 years ago

I've committed the simple clamp to 255 for all 8 affected intrinsics in 641d2e5232b423a7dd81afac94dd3db4412a4971. Using 255 avoids needing to decode the element size from the intrinsic.

Simon, do you think I should optimize it to produce 0s when possible? Or should we start working on MMX->SSE

Quuxplusone commented 4 years ago
(In reply to Craig Topper from comment #4)
> I've committed the simple clamp to 255 for all 8 affected intrinsics in
> 641d2e5232b423a7dd81afac94dd3db4412a4971. Using 255 avoids needing to decode
> the element size from the intrinsic.
>
> Simon, do you think I should optimize it to produce 0s when possible? Or
> should we start working on MMX->SSE

Returning 0s for out of range logical shifts should be a relatively small
change, making a dent in [Bug #42320] is likely to be a much bigger issue.