Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

[CompileTime] GVN takes 60+% of -O3 compile time (was: JumpThreading takes 29% of the wall O3 compile time) #16755

Open Quuxplusone opened 11 years ago

Quuxplusone commented 11 years ago
Bugzilla Link PR16756
Status REOPENED
Importance P normal
Reported by Jonathan Ragan-Kelley (jrk@csail.mit.edu)
Reported on 2013-07-30 17:01:31 -0700
Last modified on 2019-12-24 23:44:42 -0800
Version 3.2
Hardware Macintosh MacOS X
CC andrew.b.adams@gmail.com, chandlerc@gmail.com, dexonsmith@apple.com, ditaliano@apple.com, efriedma@quicinc.com, florian_hahn@apple.com, joker.eph@gmail.com, jrk@csail.mit.edu, llvm-bugs@lists.llvm.org, michael.hliao@gmail.com, nlewycky@google.com, rafael@espindo.la, rnk@google.com, yuanfang.chen@sony.com
Fixed by commit(s)
Attachments CodeGen_ARM.cpp (45497 bytes, application/octet-stream)
codegen_arm.bc (660096 bytes, application/octet-stream)
Blocks
Blocked by
See also PR41240
Created attachment 10959
Source file which causes slow compilation performance

Halide is a not-especially-complicated C++ project (which happens to use LLVM
internally, but that's not the subject of this bug):

  https://github.com/halide/Halide/

Its build process is simple. On most platforms, GCC and MSVC toolchains compile
of the C++ source quickly. On OS X (10.8), the stock g++ (llvm-gcc-4.2-based)
compiles all but one source file quickly. But on CodeGen_ARM.cpp
(https://github.com/halide/Halide/blob/master/src/CodeGen_ARM.cpp), it is
pathologically slow (>7 minutes on a 2.8ghz Core2 Xeon Mac Pro, ~5 on a
Sandybridge MacBook Air). The same file compiles in the expected second or
three on any Homebrew GCC version, using the full GNU toolchain.

Clearly, this tickles something serious in the Apple/LLVM toolchain. The
process which chugs for minutes during this is named "clang". The only obvious
potential standout here is the relatively complex stack-allocated array
"patterns" in CodeGen_ARM::visit(const Cast *op).
Quuxplusone commented 11 years ago

Attached CodeGen_ARM.cpp (45497 bytes, application/octet-stream): Source file which causes slow compilation performance

Quuxplusone commented 11 years ago
CodeGen_ARM.cpp:1:10: fatal error: 'CodeGen_ARM.h' file not found
#include "CodeGen_ARM.h"
         ^

Could you preprocess it and upload the .ii?
Quuxplusone commented 11 years ago

Full preprocessed source here: https://gist.github.com/jrk/6117757

Compiled on OS X with: c++ -O3 CodeGen_ARM.ii. (Without -O3, performance is nominal.)

Quuxplusone commented 11 years ago

-ftime-report points at SROA. Probably unhappy because the CFG is extremely complex.

Quuxplusone commented 10 years ago
Just profiled this at r206481.

The SROA slowdown is the same as http://llvm.org/bugs/show_bug.cgi?id=17855.
Bottleneck is SSAUpdater.

Running Time    Self        Symbol Name
64684.0ms   43.5%   0.0                   (anonymous
namespace)::SROA::runOnFunction(llvm::Function&)
64677.0ms   43.5%   1.0
llvm::LoadAndStorePromoter::run(llvm::SmallVectorImpl<llvm::Instruction*>
const&) const
64676.0ms   43.5%   0.0
llvm::SSAUpdater::GetValueInMiddleOfBlock(llvm::BasicBlock*)
64676.0ms   43.5%   0.0
llvm::SSAUpdater::GetValueAtEndOfBlockInternal(llvm::BasicBlock*)
64672.0ms   43.5%   9.0
llvm::SSAUpdaterImpl<llvm::SSAUpdater>::GetValue(llvm::BasicBlock*)
64379.0ms   43.3%   63469.0
llvm::SSAUpdaterImpl<llvm::SSAUpdater>::FindAvailableVals(llvm::SmallVectorImpl<llvm::SSAUpdaterImpl<llvm::SSAUpdater>::BBInfo*>*)

However, even more time is spent in CorrelatedValuePropagation.  The bottleneck
there is LVIValueHandle::deleted().

Running Time    Self        Symbol Name
68279.0ms   46.0%   30.0                      (anonymous
namespace)::CorrelatedValuePropagation::runOnFunction(llvm::Function&)
65876.0ms   44.3%   10.0
llvm::Value::replaceAllUsesWith(llvm::Value*)
65856.0ms   44.3%   12.0
llvm::ValueHandleBase::ValueIsRAUWd(llvm::Value*, llvm::Value*)
65818.0ms   44.3%   65474.0                      (anonymous
namespace)::LVIValueHandle::deleted()

I was going to close PR17855 as a dup, but now I'm thinking this PR should
track the CorrelatedValuePropagation bottleneck, while PR17855 tracks the SROA
slowdown.
Quuxplusone commented 9 years ago
r245820 fixes the SROA issue and improves -O3 compile-time from 113s to 12s on
my machine.

Top5 is now:

Running Time    Self (ms)       Symbol Name
3445.0ms   29.4%    5.0                 (anonymous
namespace)::JumpThreading::runOnFunction(llvm::Function&)
1483.0ms   12.6%    2.0                 (anonymous
namespace)::GVN::runOnFunction(llvm::Function&)
1063.0ms    9.0%    17.0                    (anonymous
namespace)::CorrelatedValuePropagation::runOnFunction(llvm::Function&)
697.0ms    5.9% 0.0                  (anonymous
namespace)::X86DAGToDAGISel::runOnMachineFunction(llvm::MachineFunction&)
589.0ms    5.0% 4.0                  (anonymous
namespace)::RegisterCoalescer::runOnMachineFunction(llvm::MachineFunction&)
Quuxplusone commented 9 years ago

Mehdi, sounds like this is fixed?

Quuxplusone commented 9 years ago
As of r253350, nothing changed since my last update, still ~30% of the total
clang invocation is spent in JumpThreading.
Did you get different measurements? I measured on OS X.
Quuxplusone commented 9 years ago
(In reply to comment #7)
> As of r253350, nothing changed since my last update, still ~30% of the total
> clang invocation is spent in JumpThreading.
> Did you get different measurements? I measured on OS X.

No, but the original report was that the file took 7m to compile. You said it
takes 12s now, despite the fact that 30% of the time is in JumpThreading. If
you want to leave it open to track future improvements to JumpThreading or GVN,
go for it.
Quuxplusone commented 9 years ago
Oh I see, yes that was my idea when I did the previous measurements.
Do you think I should have closed this bug as fixed and opened a new one for
JumpThreading?
Quuxplusone commented 5 years ago
Majority of the compile time now spent in GVN. Updating the title to reflect
that. Also attached codegen_arm.bc.

===-------------------------------------------------------------------------===
                      ... Pass execution timing report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 10.1732 seconds (10.1731 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   6.4170 ( 64.2%)   0.0243 ( 14.0%)   6.4413 ( 63.3%)   6.4422 ( 63.3%)  Global Value Numbering
   0.6697 (  6.7%)   0.0046 (  2.7%)   0.6743 (  6.6%)   0.6743 (  6.6%)  Value Propagation
   0.6127 (  6.1%)   0.0064 (  3.7%)   0.6191 (  6.1%)   0.6191 (  6.1%)  Jump Threading
   0.2233 (  2.2%)   0.0035 (  2.0%)   0.2268 (  2.2%)   0.2268 (  2.2%)  Function Integration/Inlining
   0.1531 (  1.5%)   0.0018 (  1.0%)   0.1550 (  1.5%)   0.1549 (  1.5%)  Jump Threading #2
Quuxplusone commented 5 years ago

Attached codegen_arm.bc (660096 bytes, application/octet-stream): codegen_arm.bc, reproducer for compile time issue