Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

[x86] Redundant XORs generated with BSR from "__builtin_clz". #46572

Closed Quuxplusone closed 3 years ago

Quuxplusone commented 4 years ago
Bugzilla Link PR47603
Status RESOLVED FIXED
Importance P enhancement
Reported by Tom Hender (ToHe_EMA@gmx.de)
Reported on 2020-09-21 09:49:50 -0700
Last modified on 2021-04-05 03:57:49 -0700
Version trunk
Hardware PC All
CC craig.topper@gmail.com, htmldeveloper@gmail.com, lebedev.ri@gmail.com, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, pengfei.wang@intel.com, spatel+llvm@rotateright.com
Fixed by commit(s) rG89afec348dbd3e5078f176e978971ee2d3b5dec8, rG36d4f6d7f8ad08bb99da544f2b6ca96e34977839
Attachments
Blocks
Blocked by
See also
Consider the following simple C function to be compiled with Clang:

char Func(unsigned DataOffset)
{
    return 31 ^ __builtin_clz(DataOffset);
}

Unfortunately it compiles to this assembly:

bsr     eax, edi
xor     eax, 31
xor     al, 31
ret

The relevant LLVM IR is as follows:

declare i32 @llvm.ctlz.i32(i32, i1 immarg) #1
define signext i8 @Func(i32 %0) {
  %2 = tail call i32 @llvm.ctlz.i32(i32 %0, i1 true)
  %3 = trunc i32 %2 to i8
  %4 = xor i8 %3, 31
  ret i8 %4
}

Obviously the two XORs should be entirely eliminated in the generated machine
code. This problem only occurs if the return type is a 8 bit integer (char).
For 16 bit, 32 bit and 64 bit integers the xors are elided correctly. Another
variant that leads to the same issue (without an 8 bit integer) is this:

extern char Lookup[32];
int LookupFunc(unsigned DataOffset)
{
    return Lookup[31 ^ __builtin_clz(DataOffset)];
}

With results in the following relevant LLVM IR code:

@Lookup = external dso_local local_unnamed_addr global [32 x i8], align 16
declare i32 @llvm.ctlz.i32(i32, i1 immarg) #1
define i32 @LookupFuncj(i32 %0) {
  %2 = tail call i32 @llvm.ctlz.i32(i32 %0, i1 true)
  %3 = xor i32 %2, 31
  %4 = zext i32 %3 to i64
  %5 = getelementptr inbounds [32 x i8], [32 x i8]* @Lookup, i64 0, i64 %4
  %6 = load i8, i8* %5, align 1
  %7 = sext i8 %6 to i32
  ret i32 %7
}
Quuxplusone commented 3 years ago
SelectionDAG has 13 nodes:
  t0: ch = EntryToken
            t2: i32,ch = CopyFromReg t0, Register:i32 %0
          t11: i32,i32 = X86ISD::BSR t2
        t13: i32 = xor t11, Constant:i32<31>
      t4: i8 = truncate t13
    t6: i8 = xor t4, Constant:i8<31>
  t9: ch,glue = CopyToReg t0, Register:i8 $al, t6
  t10: ch = X86ISD::RET_FLAG t9, TargetConstant:i32<0>, Register:i8 $al, t9:1

We limit trunc(xor(x,y)) -> xor(trunc(x),trunc(y)) folds to pre-legalization, I
suppose we could handle this in the X86 backend as the CTLZ/BSR+XOR case is
relatively common when we don't have LZCNT?
Quuxplusone commented 3 years ago

Reopening, the second case is still as bad as ever:

https://gcc.godbolt.org/z/Kc4GWEqnh

Quuxplusone commented 3 years ago
(In reply to Simon Pilgrim from comment #2)
> Reopening, the second case is still as bad as ever:
>
> https://gcc.godbolt.org/z/Kc4GWEqnh

Should be fixed by rG36d4f6d7f8ad08bb99da544f2b6ca96e34977839