A function with computed goto is compiled slowly and produces a large object file

Quuxplusone commented 14 years ago


Bugzilla Link	PR7362
Status	NEW
Importance	P normal
Reported by	Juhani Viheräkoski (moonshine@kapsi.fi)
Reported on	2010-06-12 09:34:27 -0700
Last modified on	2010-06-26 04:36:11 -0700
Version	trunk
Hardware	PC Linux
CC	bob.wilson@apple.com, efriedma@quicinc.com, llvm-bugs@lists.llvm.org, stoklund@2pi.dk
Fixed by commit(s)
Attachments	`vm.i.gz` (92492 bytes, application/x-gzip) `bugpoint-reduced-simplified.bc` (28960 bytes, application/octet-stream) `vm.bc` (444092 bytes, application/octet-stream)
Blocks
Blocked by
See also

Created attachment 5016
a self-contained testcase

$ time gcc -O2 -c -o vm_gcc.o vm.i

real    0m13.378s
user    0m12.248s
sys 0m0.264s
$ time clang -O2 -c -o vm_clang.o vm.i

real    1m34.942s
user    1m3.066s
sys 0m25.986s
$ size vm_gcc.o vm_clang.o
   text    data     bss     dec     hex filename
  74704     640     164   75508   126f4 vm_gcc.o
 525331    2688     145  528164   80f24 vm_clang.o

This file is from a development version of Guile, which can be built with
clang, but due to this bug it runs at least five times slower than gcc-compiled
version.

Quuxplusone commented 14 years ago

Attached vm.i.gz (92492 bytes, application/x-gzip): a self-contained testcase

Quuxplusone commented 14 years ago

The -O0 code here is okay (only slightly larger than gcc).

It appears the codesize explosion comes from the code generator handling the very large phi nodes from mem2reg badly. Looking over the -O2 generated code, something like half the generated code is spills and reloads. the vast majority of lines are labeled with either "4-byte Reload" or "4-byte Spill".

It's possible there's something subtly wrong with the LiveVariables pass messing up the output here; while messing with the testcase, I found a .bc file which SEGFAULTs llc; I'm currently trying to reduce it with bugpoint, but it's going very slowly. I can attach it if anyone is interested.

Quuxplusone commented 14 years ago

(In reply to comment #1)
> It's possible there's something subtly wrong with the LiveVariables pass
> messing up the output here; while messing with the testcase, I found a .bc
file
> which SEGFAULTs llc; I'm currently trying to reduce it with bugpoint, but it's
> going very slowly.  I can attach it if anyone is interested.

I have filed also #7348 and #7356 which happen on big functions with computed
gotos but result in sig11 (stack overflow)/sig6 (in jump threading). They are
not directly related to this bug, but many of the testcases took a long time to
compile as well before crashing..

Quuxplusone commented 14 years ago

CCing someone who might be able to figure out what's going on here.

Quuxplusone commented 14 years ago

Eli, your crasher doesn't crash for me. Is it fixed, or could you provide an
llc command line as well?
Thanks

Quuxplusone commented 14 years ago

(In reply to comment #5)
> Eli, your crasher doesn't crash for me. Is it fixed, or could you provide an
> llc command line as well?
> Thanks

I can't reproduce the crash anymore.

Quuxplusone commented 14 years ago

Juhani, I can't compile your precompiled file on Darwin. Could you produce a
bitcode file with "clang -O2 -emit-llvm" please?
The slowdown is likely to be in the code generator, as "llc -O2 -time-passes
vm.bc" should tell you.

Quuxplusone commented 14 years ago

Attached bugpoint-reduced-simplified.bc (28960 bytes, application/octet-stream): Bugpoint-reduced crash

Quuxplusone commented 14 years ago

Attached vm.bc (444092 bytes, application/octet-stream): Testcase compiled to .bc

Quuxplusone commented 14 years ago

Thanks, Eli.

It is interesting that PHI elimination takes to long. I thought we had that under control for these indirectbr interpreters.

PHI elimination is uniqueing phi nodes to limit the number of join registers produced. Early tail duplication is generating lots of identical phi nodes.

I'd bet -disable-early-taildup would improve both compile time and code size a lot. Unfortunately, early tail dup is critical to getting good branch prediction.

Quuxplusone commented 14 years ago

(In reply to comment #9)
> Thanks, Eli.
>
> It is interesting that PHI elimination takes to long. I thought we had that
> under control for these indirectbr interpreters.
>
> PHI elimination is uniqueing phi nodes to limit the number of join registers
> produced. Early tail duplication is generating lots of identical phi nodes.
>
> I'd bet -disable-early-taildup would improve both compile time and code size a
> lot. Unfortunately, early tail dup is critical to getting good branch
> prediction.

It does help a lot: it makes llc about 15x faster and reduces the codesize to
something comparable with gcc.  Maybe some heuristic could be added to prevent
tail dup on the blocks in question before regalloc?

Quuxplusone commented 14 years ago

(In reply to comment #10)
> (In reply to comment #9)
> > Thanks, Eli.
> >
> > It is interesting that PHI elimination takes to long. I thought we had that
> > under control for these indirectbr interpreters.
> >
> > PHI elimination is uniqueing phi nodes to limit the number of join registers
> > produced. Early tail duplication is generating lots of identical phi nodes.
> >
> > I'd bet -disable-early-taildup would improve both compile time and code
size a
> > lot. Unfortunately, early tail dup is critical to getting good branch
> > prediction.
>
> It does help a lot: it makes llc about 15x faster and reduces the codesize to
> something comparable with gcc.  Maybe some heuristic could be added to prevent
> tail dup on the blocks in question before regalloc?

Wait, nevermind... it's supposed to help cases like this, isn't it. :)

Quuxplusone commented 14 years ago

Yes, exactly :-)

The frontend funnels all indirect branches through a single basic block, %indirectgoto in this case. This block has 206 ingoing and 183 outgoing CFG edges. This is a managable CFG, but that indirect branch is completely impossible to handle for the CPU branch predictor.

Early tail duplication creates 206 copies of the indirect branch block, each with 183 outgoing CFG edges. This is the CFG explosion that the frontend was trying to avoid, but unfortunately it is necessary to produce good code. With an individual indirect branch, the CPU branch predictor gets to do its Markov chain thing based on the branch instruction address.

Part of the philosophy here is that any function using computed gotos is written with performance in mind, and it is OK to spend some extra compile time on getting good code. It looks like we are failing to do that here, though.

Quuxplusone commented 14 years ago

Sorry about the late response, I was on holiday. I have about 20 *huge* single-
function testcases with computed gotos that all compile to good code so there
is something peculiar in this VM code that causes bad regalloc. I also agree
that VM runtime speed is more important than compile-time speed.

Quuxplusone / LLVMBugzillaTest

A function with computed goto is compiled slowly and produces a large object file #4935