Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

LLVM 15 times slower than gcc compiling the erlang VM-core #21184

Open Quuxplusone opened 10 years ago

Quuxplusone commented 10 years ago
Bugzilla Link PR21185
Status NEW
Importance P normal
Reported by Frej Drejhammar (frej@sics.se)
Reported on 2014-10-07 08:47:39 -0700
Last modified on 2014-10-10 07:38:45 -0700
Version trunk
Hardware PC Linux
CC hfinkel@anl.gov, Lars.Rasmusson@sics.se, llvm-bugs@lists.llvm.org, rafael@espindo.la, rnk@google.com
Fixed by commit(s)
Attachments beam_emu-clang.i (858075 bytes, application/octet-stream)
beam_emu-clang.bc (191404 bytes, application/octet-stream)
dump.dot (22240 bytes, application/dot)
0001-LiveVariables-isLiveOut-Avoid-expensive-sort-if-poss.patch (1701 bytes, text/plain)
0001-LiveVariables-isLiveOut-Avoid-expensive-sort-if-poss.patch (1692 bytes, text/plain)
Blocks
Blocked by
See also
Created attachment 13159
Preprocessed input

[This bug report is on LLVM and Clang 3.5 from the git mirror, (LLVM:
 df820bfd87d7d70d8f1a739333496e191631a3e2; Clang:
 870571e52b265a9a471196b1cda93445c9262541;)
 release-compiled without assertions]

Compiling the core interpreter module of the Erlang VM [1] triggers
non-linear behaviour of LLVM's code generator.

The C-module in question is erts/emulator/beam/beam_emu.c in [1], for
convenience i have attached the preprocessed input to this bug-report.

Compiling beam_emu.c using clang -O3 takes roughly 45s, compiling it
using gcc -O3 takes 3s. Running:

clang -O3 -emit-llvm beam_emu-clang.i -o beam_emu.bc takes 2s

llc -O3 llc -O3 beam_emu-clang.bc takes 43s [bitcode attached]

Which indicates that the slow-down is in the common code generation
code.

Using a profile-enabled llc I can see that (60%) of the runtime is
spent inside LiveVariables::isLiveOut() as called from
PHIElimination::runOnMachineFunction().

For analyzing this bug it may help to know that beam_emu.c implements
a directly threaded interpreter. For each VM instruction there is a
first-class label followed by code implementing the instruction (much
of it automatically generated). Each instruction ends with a goto*
which branches to the next VM instruction.

As part of the BEAMJIT-project[2] we extend the VM by adding
additional entry points. The resulting three-fold increase in the
number of basic blocks increases the run-time to 30min, which may
serve as a hint to what is wrong.

[1] http://www.erlang.org/download/otp_src_17.3.tar.gz
erts/emulator/beam/beam_emu.c

[2] http://llvm.org/devmtg/2014-04/PDFs/Talks/drejhammar.pdf
Quuxplusone commented 10 years ago

Attached beam_emu-clang.i (858075 bytes, application/octet-stream): Preprocessed input

Quuxplusone commented 10 years ago

Attached beam_emu-clang.bc (191404 bytes, application/octet-stream): Input bitcode

Quuxplusone commented 10 years ago

[Fix typos in commands to replicate the bug]

clang -O3 -emit-llvm beam_emu-clang.i -o beam_emu-clang.bc

llc -O3 beam_emu-clang.bc

Quuxplusone commented 10 years ago
It is a well-known problem that the register allocator has quadratic complexity
w.r.t. the number of basic blocks:
http://llvm.org/bugs/show_bug.cgi?id=12652
http://llvm.org/bugs/show_bug.cgi?id=17409

Unfortunately, Jakob, the primary author of the register allocator, isn't
working on LLVM these days, so it's unlikely that this will be addressed any
time soon. :(
Quuxplusone commented 10 years ago

Although the register allocator is a problem, profiling while running this example the dominant part of the execution time (66%) goes to PHI-elimination, 13% to the coalescer and only 4% to register allocation (unless you consider all those as register-allocation). Also the profile does not look at all like the profiles in bugs http://llvm.org/bugs/show_bug.cgi?id=12652 and http://llvm.org/bugs/show_bug.cgi?id=17409

I will attach a .dot-file with the annotated call graph (produced by gprof2dot[1]) for running this input.

[1] https://code.google.com/p/jrfonseca/wiki/Gprof2Dot

Quuxplusone commented 10 years ago

Attached dump.dot (22240 bytes, application/dot): Annotated call graph with profiling result

Quuxplusone commented 10 years ago
(In reply to comment #4)
> Although the register allocator is a problem, profiling while running this
> example the dominant part of the execution time (66%) goes to
> PHI-elimination, 13% to the coalescer and only 4% to register allocation
> (unless you consider all those as register-allocation). Also the profile
> does not look at all like the profiles in bugs
> http://llvm.org/bugs/show_bug.cgi?id=12652 and
> http://llvm.org/bugs/show_bug.cgi?id=17409
>
> I will attach a .dot-file with the annotated call graph (produced by
> gprof2dot[1]) for running this input.
>
> [1] https://code.google.com/p/jrfonseca/wiki/Gprof2Dot

Interesting, then this may fixable.
Quuxplusone commented 10 years ago

s/NewPC/PC

Quuxplusone commented 10 years ago

Attached 0001-LiveVariables-isLiveOut-Avoid-expensive-sort-if-poss.patch (1701 bytes, text/plain): LiveVariables::isLiveOut: Avoid expensive sort if possible

Quuxplusone commented 10 years ago

Attached 0001-LiveVariables-isLiveOut-Avoid-expensive-sort-if-poss.patch (1692 bytes, text/plain): LiveVariables::isLiveOut: Avoid expensive sort if possible

Quuxplusone commented 10 years ago

Please make sure that you send patches to the llvm-commits list, or use reviews.llvm.org (see http://llvm.org/docs/Phabricator.html), for review. Patches posted to bug reports will likely be ignored. When you post for review, please write a description of the problem and the proposed solution (don't just say, for example, "Fix for PR21185"). Please see http://llvm.org/docs/DeveloperPolicy.html for more information. Thanks!