Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

Missed conditional tail call opportunity #49094

Open Quuxplusone opened 3 years ago

Quuxplusone commented 3 years ago
Bugzilla Link PR50125
Status NEW
Importance P enhancement
Reported by Hans Wennborg (hans@chromium.org)
Reported on 2021-04-26 06:58:43 -0700
Last modified on 2021-10-16 10:43:28 -0700
Version trunk
Hardware PC Linux
CC craig.topper@gmail.com, david.bolvansky@gmail.com, jhaberman@gmail.com, llvm-bugs@lists.llvm.org, ndesaulniers@google.com, nok.raven@gmail.com, yuanfang.chen@sony.com
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also PR50130
Based on an example from https://blog.reverberate.org/2021/04/21/musttail-
efficient-interpreters.html (https://godbolt.org/z/4afYjjPro)

$ cat /tmp/a.c
void *fallback(unsigned char *data);
void *stuff(unsigned char *data);

void *foo(unsigned char *data)
{
  if (*data & 0x1) {
    return fallback(data);
  }
  data++;
  return stuff(data);
}

$ clang -target x86_64-unknown-linux-gnu -Os -S -o - /tmp/a.c
        .text
        .file   "a.c"
        .globl  foo                             # -- Begin function foo
        .type   foo,@function
foo:                                    # @foo
        .cfi_startproc
# %bb.0:                                # %entry
        testb   $1, (%rdi)
        jne     .LBB0_1
# %bb.2:                                # %if.end
        incq    %rdi
        jmp     stuff                           # TAILCALL
.LBB0_1:                                # %if.then
        jmp     fallback                        # TAILCALL
.Lfunc_end0:
        .size   foo, .Lfunc_end0-foo
        .cfi_endproc

Instead of:

        jne     .LBB0_1
.LBB0_1:
        jmp     fallback

It could just do:

        jne     .LBB0_1

The code in BranchFolding.cpp is supposed to do this:
https://github.com/llvm/llvm-project/blob/e439a463a30833f1c7d366ed722f0f12d1682638/llvm/lib/CodeGen/BranchFolding.cpp#L1520

But it's not firing here.
Quuxplusone commented 3 years ago
https://godbolt.org/z/znGnaaWMn

It fails to fire when you have some instructions between jcc and label.

upb_pf32_1bt:                           # @upb_pf32_1bt
        test    r9b, r9b
        jne     .LBB0_1
        inc     rsi
        jmp     dispatch                        # TAILCALL
.LBB0_1:
        jmp     fallback                        # TAILCALL

Remove ptr += 1 and you have:

upb_pf32_1bt:                           # @upb_pf32_1bt
        test    r9b, r9b
        je      dispatch                        # TAILCALL
        jmp     fallback                        # TAILCALL

https://github.com/llvm/llvm-project-staging/blob/8e505ae71e3a493593884c692544f788e9c884ce/llvm/lib/Target/X86/X86InstrInfo.cpp#L2912

So maybe this is place to handle this case?

cc Craig.
Quuxplusone commented 3 years ago

A comment on HN (https://news.ycombinator.com/item?id=26934616) suggests that it is related to the following code:

https://github.com/llvm/llvm-project/blob/d480f968ad8b56d3ee4a6b6df5532d485b0ad01e/llvm/lib/CodeGen/BranchFolding.cpp#L1521-L1523

On the other hand, the example above indicates that this is not optimizing even in -Os, so maybe this is a red herring.

Quuxplusone commented 3 years ago
PredTBB == MBB is false for case in OP.

if I dump PredTBB after analyseBranch:

bb.2.if.end:
; predecessors: %bb.0
  liveins: $rdi
  renamable $rdi = nuw INC64r killed renamable $rdi(tied-def 0), implicit-def dead $eflags
  TCRETURNdi64 @stuff, 0, <regmask $bh $bl $bp $bph $bpl $bx $ebp $ebx $hbp $hbx $rbp $rbx $r12 $r13 $r14 $r15 $r12b $r13b $r14b $r15b $r12bh $r13bh $r14bh $r15bh $r12d $r13d $r14d $r15d $r12w $r13w $r14w $r15w $r12wh and 3 more...>, implicit $rsp, implicit $ssp, implicit $rdi

        .globl  foo                             # -- Begin function foo
        .type   foo,@function
foo:                                    # @foo
        .cfi_startproc
# %bb.0:                                # %entry
        testb   $1, (%rdi)
        jne     .LBB0_1
# %bb.2:                                # %if.end
        incq    %rdi
        jmp     stuff                           # TAILCALL
.LBB0_1:                                # %if.then
        jmp     fallback                        # TAILCALL
.Lfunc_end0:
        .size   foo, .Lfunc_end0-foo
        .cfi_endproc
                                        # -- End function
        .ident  "clang version 13.0.0 (https://github.com/llvm/llvm-project 942d2e19e11d6d65bf147ecab8513b7e8bd5d7e4)"
        .section        ".note.GNU-stack","",@progbits
        .addrsig

If I disable this check (too conversative?) I get:
foo:                                    # @foo
        .cfi_startproc
# %bb.0:                                # %entry
        testb   $1, (%rdi)
        je      fallback                        # TAILCALL
# %bb.1:                                # %if.end
        incq    %rdi
        jmp     stuff                           # TAILCALL

Looks good.
Quuxplusone commented 3 years ago

... but it should be jne fallback. So something bad happens in analyzeBranch

Quuxplusone commented 3 years ago

_Bug 50130 has been marked as a duplicate of this bug._

Quuxplusone commented 3 years ago
I dug into this a bit.  The full function prior to the branch folding pass is:

---

Function:
# Machine code for function foo: NoPHIs, TracksLiveness, NoVRegs,
TiedOpsRewritten
Function Live Ins: $rdi

bb.0.entry:
  successors: %bb.2(0x7fef9fcb), %bb.1(0x00106035); %bb.2(99.95%), %bb.1(0.05%)
  liveins: $rdi
  TEST8mi renamable $rdi, 1, $noreg, 0, $noreg, 1, implicit-def $eflags :: (load (s8) from %ir.data, !tbaa !3)
  JCC_1 %bb.2, 4, implicit killed $eflags

bb.1.if.then:
; predecessors: %bb.0
  liveins: $rdi
  TCRETURNdi64 @fallback, 0, <regmask $bh $bl $bp $bph $bpl $bx $ebp $ebx $hbp $hbx $rbp $rbx $r12 $r13 $r14 $r15 $r12b $r13b $r14b $r15b $r12bh $r13bh $r14bh $r15bh $r12d $r13d $r14d $r15d $r12w $r13w $r14w $r15w $r12wh and 3 more...>, implicit $rsp, implicit $ssp, implicit $rdi

bb.2.if.end:
; predecessors: %bb.0
  liveins: $rdi
  renamable $rdi = nuw ADD64ri8 killed renamable $rdi(tied-def 0), 1, implicit-def dead $eflags
  TCRETURNdi64 @stuff, 0, <regmask $bh $bl $bp $bph $bpl $bx $ebp $ebx $hbp $hbx $rbp $rbx $r12 $r13 $r14 $r15 $r12b $r13b $r14b $r15b $r12bh $r13bh $r14bh $r15bh $r12d $r13d $r14d $r15d $r12w $r13w $r14w $r15w $r12wh and 3 more...>, implicit $rsp, implicit $ssp, implicit $rdi

---

At this point, the code is arranged such that "fallback" is the fallthrough,
corresponding to something like:

# bb.0:
    test  BYTE PTR [rdi],0x1
    je    bb.2
# bb.1
    jmp   fallback
bb.2:
    add   rdi,0x1
    jmp   stuff

The branch folding optimization at https://github.com/llvm/llvm-
project/blob/e439a463a30833f1c7d366ed722f0f12d1682638/llvm/lib/CodeGen/BranchFolding.cpp#L1520-
L1554 is looking for a MBB that fits these criteria:

1. MBB has only a single predecessor
2. MBB's first (non-debug) instruction is an unconditional tail call
3. MBB is the target of a conditional branch.

Both bb.1 and bb.2 fail this check:

- bb.1 is not the target of a conditional branch, failing (3)
- bb.2 is not a basic block consisting of a single tail call, failing (2)

Removing the "ptr += 1" makes bb.2 satisfy (2), so the optimization is
performed.

Removing the "PredTBB == MBB" check removes condition (3), which will make bb.1
satisfy the criteria as the fallthrough successor. But this makes the
optimization incorrect because the code gets rewritten to:

# bb.0:
    test  BYTE PTR [rdi],0x1
    je    fallback
# bb.2:
    add   rdi,0x1
    jmp   stuff

I'm not sure the best solution.  One option would be to rewrite as:

# bb.0:
    test  BYTE PTR [rdi],0x1
    jne   fallback
    # In the general case we need this, since we aren't
    # guaranteed that the target is the fallthrough.  But
    # if we know it's the fallthrough, we can omit it.
    jmp   bb.2
bb.2:
    add   rdi,0x1
    jmp   stuff
Quuxplusone commented 3 years ago
Or maybe branch folding needs to run later? ie. after whatever pass is
rewriting from:

    test  BYTE PTR [rdi],0x1
    je    bb.2
    jmp   fallback
bb.2:
    add   rdi,0x1
    jmp   stuff

to the code we see in the output:

    test  BYTE PTR [rdi],0x1
    jne   fb
    add   rdi,0x1
    jmp   stuff
fb:
    jmp   fallback
Quuxplusone commented 3 years ago

When I use -print-after-all, it appears that "block-placement" is performing the rewrite I mentioned in my last comment: https://godbolt.org/z/nbhcj5hGd

But "block-placement" is running after "branch-folder", so the latter never sees the conditional branch in the form that it wants.

Should branch-folder run later? Or run twice? I'm new to LLVM so I don't have a sense for what the right fix is here.

Quuxplusone commented 3 years ago

Here is a better goldbolt link that filters out all the extraneous info: https://godbolt.org/z/ennsKWdTK