Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

Compiling with -O1 is slower than -O2 #42054

Open Quuxplusone opened 5 years ago

Quuxplusone commented 5 years ago
Bugzilla Link PR43084
Status NEW
Importance P normal
Reported by Haochen He (hehaochen13@nudt.edu.cn)
Reported on 2019-08-22 01:21:13 -0700
Last modified on 2019-12-21 20:34:33 -0800
Version 3.4
Hardware PC Linux
CC dblaikie@gmail.com, echristo@gmail.com, htmldeveloper@gmail.com, llvm-bugs@lists.llvm.org, neeilans@live.com, richard-llvm@metafoo.co.uk, spatel+llvm@rotateright.com, yuanfang.chen@sony.com
Fixed by commit(s)
Attachments pan.tgz (350255 bytes, application/gzip)
Blocks
Blocked by
See also
Created attachment 22412
from clangBug-14651

See the following results:

##### This is normal #####
-- clang version 3.4.2 --(CentOS Linux release 7.6.1810 Core)
time clang++ -O0 -w tramp3d-v4.cpp  5.951s
time clang++ -O1 -w tramp3d-v4.cpp  9.890s
time clang++ -O2 -w tramp3d-v4.cpp  12.931s
time clang++ -O3 -w tramp3d-v4.cpp  14.078s
time clang++ -Os -w tramp3d-v4.cpp  11.100s
time clang++ -Ofast -w tramp3d-v4.cpp  14.107s

##### This is normal #####
-- Apple clang version 11.0.0 (clang-1100.0.20.17) --(MacOS Mojave 10.14.6)
time clang -DHC4 -DSAFETY -DNOREDUCE -DNFAIR=3 -O0 -o files pan.c  1.176s
time clang -DHC4 -DSAFETY -DNOREDUCE -DNFAIR=3 -O1 -o files pan.c  12.805s
time clang -DHC4 -DSAFETY -DNOREDUCE -DNFAIR=3 -O2 -o files pan.c  20.955s
time clang -DHC4 -DSAFETY -DNOREDUCE -DNFAIR=3 -O3 -o files pan.c  20.907s
time clang -DHC4 -DSAFETY -DNOREDUCE -DNFAIR=3 -Os -o files pan.c  17.487s
time clang -DHC4 -DSAFETY -DNOREDUCE -DNFAIR=3 -Ofast -o files pan.c  20.362s

##### This is NOT normal #####
-- clang version 3.4.2 --(CentOS Linux release 7.6.1810 Core)
time clang -DHC4 -DSAFETY -DNOREDUCE -DNFAIR=3 -O0 -o files pan.c  1.440s
time clang -DHC4 -DSAFETY -DNOREDUCE -DNFAIR=3 -O1 -o files pan.c  37.666s //
This is NOT normal
time clang -DHC4 -DSAFETY -DNOREDUCE -DNFAIR=3 -O2 -o files pan.c  26.200s
time clang -DHC4 -DSAFETY -DNOREDUCE -DNFAIR=3 -O3 -o files pan.c  26.780s
time clang -DHC4 -DSAFETY -DNOREDUCE -DNFAIR=3 -Os -o files pan.c  18.114s
time clang -DHC4 -DSAFETY -DNOREDUCE -DNFAIR=3 -Ofast -o files pan.c  26.185s

pan.c is from clang-14651(https://bugs.llvm.org/show_bug.cgi?id=14651), while
tramp3d-v4 is an open source benchmark.

As described in the GCC documentation: "-O2 turns on all optimization flags
specified by -O1 and it also turns on the following optimization flags: xxx,
xxx..."(and I think it is similar in
clang(https://stackoverflow.com/questions/15548023/clang-optimization-levels))
So I think compiling with -O1 are not expected to be slower than that with -O2.
So I think this may be a performance bug.
Quuxplusone commented 5 years ago

Attached pan.tgz (350255 bytes, application/gzip): from clangBug-14651

Quuxplusone commented 5 years ago
The problem lies in "Simple Register Coalescing":

clang -ftime-report -DHC4 -DSAFETY -DNOREDUCE -DNFAIR=3 -O1 -o files pan.c
===-------------------------------------------------------------------------===
                              Register Allocation
===-------------------------------------------------------------------------===
  Total Execution Time: 0.0372 seconds (0.0373 wall clock)
  ......(fast)

===-------------------------------------------------------------------------===
                      Instruction Selection and Scheduling
===-------------------------------------------------------------------------===
  Total Execution Time: 1.9975 seconds (1.9931 wall clock)
  ......(fast)

===-------------------------------------------------------------------------===
                                 DWARF Emission
===-------------------------------------------------------------------------===
  Total Execution Time: 0.0012 seconds (0.0012 wall clock)
  ......(fast)

===-------------------------------------------------------------------------===
                      ... Pass execution timing report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 36.5249 seconds (36.5250 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
  28.0486 ( 77.7%)   0.0018 (  0.4%)  28.0504 ( 76.8%)  28.0517 ( 76.8%)  Simple Register Coalescing (!!PROBLEM!!)
   2.6494 (  7.3%)   0.3306 ( 75.9%)   2.9800 (  8.2%)   2.9801 (  8.2%)  X86 DAG->DAG Instruction Selection
   0.6952 (  1.9%)   0.0118 (  2.7%)   0.7070 (  1.9%)   0.7070 (  1.9%)  Greedy Register Allocator
   0.6730 (  1.9%)   0.0011 (  0.3%)   0.6741 (  1.8%)   0.6742 (  1.8%)  Simplify the CFG
   0.4630 (  1.3%)   0.0026 (  0.6%)   0.4657 (  1.3%)   0.4656 (  1.3%)  Combine redundant instructions
   .....
  36.0891 (100.0%)   0.4358 (100.0%)  36.5249 (100.0%)  36.5250 (100.0%)  Total

===-------------------------------------------------------------------------===
                         Miscellaneous Ungrouped Timers
===-------------------------------------------------------------------------===

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
  36.7598 ( 50.2%)   0.5495 ( 52.1%)  37.3093 ( 50.3%)  37.3111 ( 50.3%)  Clang front-end timer
  36.2154 ( 49.5%)   0.4579 ( 43.4%)  36.6733 ( 49.4%)  36.6750 ( 49.4%)  Code Generation Time
   0.1997 (  0.3%)   0.0475 (  4.5%)   0.2472 (  0.3%)   0.2474 (  0.3%)  LLVM IR Generation Time
  73.1750 (100.0%)   1.0549 (100.0%)  74.2298 (100.0%)  74.2335 (100.0%)  Total

clang -ftime-report -DHC4 -DSAFETY -DNOREDUCE -DNFAIR=3 -O1 -o files pan.c
===-------------------------------------------------------------------------===
                              Register Allocation
===-------------------------------------------------------------------------===
  Total Execution Time: 0.1130 seconds (0.1127 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   0.0514 ( 51.9%)   0.0070 ( 50.2%)   0.0584 ( 51.7%)   0.0580 ( 51.5%)  Global Splitting
   0.0195 ( 19.6%)   0.0025 ( 17.8%)   0.0219 ( 19.4%)   0.0217 ( 19.2%)  Spiller
   0.0147 ( 14.8%)   0.0041 ( 29.4%)   0.0187 ( 16.6%)   0.0191 ( 17.0%)  Evict
   0.0134 ( 13.6%)   0.0001 (  0.6%)   0.0135 ( 12.0%)   0.0135 ( 12.0%)  Seed Live Regs
   0.0001 (  0.1%)   0.0003 (  2.0%)   0.0004 (  0.4%)   0.0004 (  0.4%)  Local Splitting
   0.0991 (100.0%)   0.0139 (100.0%)   0.1130 (100.0%)   0.1127 (100.0%)  Total

===-------------------------------------------------------------------------===
                      Instruction Selection and Scheduling
===-------------------------------------------------------------------------===
  Total Execution Time: 3.0434 seconds (3.0434 wall clock)

   ......(fast)

===-------------------------------------------------------------------------===
                                 DWARF Emission
===-------------------------------------------------------------------------===
  Total Execution Time: 0.0014 seconds (0.0015 wall clock)
  ......(fast)
===-------------------------------------------------------------------------===
                      ... Pass execution timing report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 26.9566 seconds (26.9558 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   5.6477 ( 21.6%)   0.0010 (  0.1%)   5.6487 ( 21.0%)   5.6489 ( 21.0%)  Simple Register Coalescing  (when using -O1, this use 28sec !!!)
   4.7472 ( 18.2%)   0.6485 ( 76.5%)   5.3957 ( 20.0%)   5.3959 ( 20.0%)  X86 DAG->DAG Instruction Selection
   3.4128 ( 13.1%)   0.0288 (  3.4%)   3.4416 ( 12.8%)   3.4417 ( 12.8%)  Global Value Numbering
   1.7108 (  6.6%)   0.0003 (  0.0%)   1.7112 (  6.3%)   1.7112 (  6.3%)  Eliminate PHI nodes for register allocation
   0.7738 (  3.0%)   0.0004 (  0.0%)   0.7742 (  2.9%)   0.7742 (  2.9%)  Control Flow Optimizer
   0.7101 (  2.7%)   0.0005 (  0.1%)   0.7106 (  2.6%)   0.7106 (  2.6%)  Merge disjoint stack slots
   0.6681 (  2.6%)   0.0002 (  0.0%)   0.6683 (  2.5%)   0.6683 (  2.5%)  Simplify the CFG
   ...

===-------------------------------------------------------------------------===
                         Miscellaneous Ungrouped Timers
===-------------------------------------------------------------------------===

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
  26.8342 ( 50.3%)   0.9470 ( 50.7%)  27.7812 ( 50.4%)  27.7824 ( 50.4%)  Clang front-end timer
  26.2796 ( 49.3%)   0.8709 ( 46.6%)  27.1505 ( 49.2%)  27.1517 ( 49.2%)  Code Generation Time
   0.1917 (  0.4%)   0.0501 (  2.7%)   0.2418 (  0.4%)   0.2421 (  0.4%)  LLVM IR Generation Time
  53.3055 (100.0%)   1.8680 (100.0%)  55.1735 (100.0%)  55.1762 (100.0%)  Total
Quuxplusone commented 5 years ago

Working on it.