Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

[Compile Time] 50% of time spent in Elim phi nodes #43219

Closed Quuxplusone closed 4 years ago

Quuxplusone commented 4 years ago
Bugzilla Link PR44249
Status RESOLVED FIXED
Importance P enhancement
Reported by Jonas Paulsson (paulsson@linux.vnet.ibm.com)
Reported on 2019-12-08 06:20:44 -0800
Last modified on 2020-02-05 15:17:50 -0800
Version trunk
Hardware PC Linux
CC david.bolvansky@gmail.com, llvm-bugs@lists.llvm.org, paulsson@linux.vnet.ibm.com, quentin.colombet@gmail.com, uweigand@de.ibm.com, yuanfang.chen@sony.com
Fixed by commit(s)
Attachments tc.tar.gz (224308 bytes, application/gzip)
Z_phielim.tar.gz (563159 bytes, application/gzip)
Blocks
Blocked by
See also
clang -O3 -march=z13 tc_ctime_elimphi_gvn.i -o a.out -w -ftime-report

---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name
---
230.3186 ( 51.8%)   0.1120 ( 16.9%)  230.4306 ( 51.7%)  232.3138 ( 51.7%)
Eliminate PHI nodes for register allocation
101.7915 ( 22.9%)   0.1046 ( 15.8%)  101.8961 ( 22.9%)  102.6262 ( 22.8%)
Global Value Numbering
 17.3571 (  3.9%)   0.0178 (  2.7%)  17.3749 (  3.9%)  17.5102 (  3.9%)  Memory SSA #2
  8.7322 (  2.0%)   0.0129 (  2.0%)   8.7451 (  2.0%)   8.8148 (  2.0%)  Greedy Register Allocator
  8.4621 (  1.9%)   0.0887 ( 13.4%)   8.5508 (  1.9%)   8.6025 (  1.9%)  SystemZ DAG->DAG Pattern Instruction Selection

gcc -O3: 74s
clang -O3: 446s

(csmith)
Quuxplusone commented 4 years ago

Please attach tc_ctime_elimphi_gvn.i.

Quuxplusone commented 4 years ago

Attached tc.tar.gz (224308 bytes, application/gzip): tarball with testcase

Quuxplusone commented 4 years ago

Attached Z_phielim.tar.gz (563159 bytes, application/gzip): reduced testcases in compressed tar file

Quuxplusone commented 4 years ago

If BitVector is used instead of DenseSet for the Defs and Kills sets in LiveVariables::addNewBlock(), I see unchanged compile time on X86 and PPC, while on SystemZ improves from 80 seconds to 56 seconds.

Not sure why BitVector isn't used already, or if it would be a good idea - does anyone know?

Still not sure why there still is such a slowdown for SystemZ (x10). Comparing to X86, the functions seem to be very similar - the same number of basic blocks and PHI instructions. The structure of the PHIs are also the same - there are no PHI users of PHI definitions for either target...

Quuxplusone commented 4 years ago
The reason that the SystemZ backend is slower seems to be because it uses
nearly twice as many virtual registers / instructions than when X-compiling to
x86 (both on SystemZ). This is due to a lot of IMPLICIT_DEFs and
INSERT_SUBREGs, that the SystemZ backend emits (in a seemingly sensible way).
Even in the partially reduced test case (tc_creduce_partial.i), this function
has ~67000 registers in ~1900 blocks and 875 blocks created during critical
edge splitting (SystemZ backend).

PHIElimination suffers from a severe slow down with this huge function. The
reason is apparently that when SplitCriticalEdge() updates LiveVariables by
calling
addNewBlock(), all the registers are iterated over every single time.

The only way to fully remedy this seems to be to pre-compute the live-in
registers for each MBB in an efficient way in PHIElimination.cpp and then pass
that information along to LiveVariabless::addNewBlock(). I have made a patch
for this: https://reviews.llvm.org/D73152.

(The original idea of using a BitVector in addNewBlock() only removed part of
the regression, but was still a good speedup. Since this is reached by other
passes I would assume that this improvement would still be desired, or? That
would be a separate patch...)
Quuxplusone commented 4 years ago

Handled with https://reviews.llvm.org/D73152 (committed as 96ea377).