Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

[X86] Instruction reordering prevents EFLAGS folding #32334

Open Quuxplusone opened 7 years ago

Quuxplusone commented 7 years ago
Bugzilla Link PR33362
Status NEW
Importance P enhancement
Reported by Simon Pilgrim (llvm-dev@redking.me.uk)
Reported on 2017-06-08 09:19:16 -0700
Last modified on 2017-09-12 05:57:55 -0700
Version trunk
Hardware PC Windows NT
CC andrea.dibiagio@gmail.com, ditaliano@apple.com, filcab@gmail.com, jatin.bhateja@gmail.com, llvm-bugs@lists.llvm.org, llvm-dev@ndave.org, marina.yatsina@intel.com, spatel+llvm@rotateright.com
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also
https://godbolt.org/g/l1T1da

Swapping 2 instructions over prevents the folding of an EFLAGS result,
requiring an additional comparison to regenerate them. Ideally we'd recognise
which instructions are actually using the EFLAGS result:

define i32 @clz_i128(i64, i64)  {
  %3 = tail call i64 @llvm.ctlz.i64(i64 %1, i1 false)
  %4 = tail call i64 @llvm.ctlz.i64(i64 %0, i1 false)
  %5 = icmp ne i64 %0, 0
  %6 = select i1 %5, i64 0, i64 %3
  %7 = add nuw nsw i64 %6, %4
  %8 = trunc i64 %7 to i32
  ret i32 %8
}
define i32 @clz_i128_swap(i64, i64)  {
  %3 = tail call i64 @llvm.ctlz.i64(i64 %0, i1 false)    <-- SWAP
  %4 = tail call i64 @llvm.ctlz.i64(i64 %1, i1 false)    <-- SWAP
  %5 = icmp ne i64 %0, 0
  %6 = select i1 %5, i64 0, i64 %4
  %7 = add nuw nsw i64 %6, %3
  %8 = trunc i64 %7 to i32
  ret i32 %8
}
declare i64 @llvm.ctlz.i64(i64, i1)

clz_i128:
        lzcntq  %rsi, %rcx
        xorl    %edx, %edx
        lzcntq  %rdi, %rax
        cmovael %edx, %ecx
        addl    %ecx, %eax
        retq
clz_i128_swap:
        lzcntq  %rdi, %rax
        lzcntq  %rsi, %rcx
        xorl    %edx, %edx
        testq   %rdi, %rdi    <-- EXTRA TEST
        cmovnel %edx, %ecx
        addl    %ecx, %eax
        retq
Quuxplusone commented 7 years ago
Analysis :

1/
PRE-RA Instruction Scheduling algorithm which scans instruction bottom up by
default picks instruction as per source order from the available instructions
to be scheduled in next cycle. Of course  instructions are available in the
priority queue only after they have satisfied precedence and resource
constraints.

Thus LZCNT64rr <second_argument> which corresponds to  following IR
    %4 = tail call i64 @llvm.ctlz.i64(i64 %1, i1 false)    <-- SWAP

is scheduled later (remember algorithm works bottom-up so final schedule is
reverse order of instructions which algo produces.)

2/ Peephole optimizer which kicks in later down the pipeline could fold the
TEST with CMPNE to produce CMPAE in first case (clz_i128) as LZCNT operating on
first argument was adjacent to TEST instruction and both are EFLAGS reg
defining MIs.

3/ In second function(clz_i128_swap) due to swapping of llvm.ctlz LZCNT
operating on first argument of function is no longer adjacent to TEST due to
reason mentioned in Point 1/. Which is why flag floding could not happen.
Quuxplusone commented 7 years ago

I just modified the heuristic by taking into consideration already scheduled instruction which share their operands with the next instruction to be chosen from from the available queue under option -pre-RA-sched=guided-src.

This case works, please share your thoughts on this and what could be a better heuristic.

Thanks.

Quuxplusone commented 7 years ago

https://reviews.llvm.org/D34596

Here is the code diff.

Thanks.