Closed Quuxplusone closed 10 years ago
Attached biggraph.C
(789356 bytes, application/octet-stream): source file showing problem.
The -ftime-report gives:
$ time clang++ -ftime-report -O1 -c biggraph.C
===-------------------------------------------------------------------------===
Register Allocation
===-------------------------------------------------------------------------===
Total Execution Time: 0.0052 seconds (0.0052 wall clock)
---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name ---
0.0026 ( 56.2%) 0.0005 ( 74.2%) 0.0030 ( 58.4%) 0.0030 ( 58.4%) Local Splitting
0.0019 ( 41.7%) 0.0001 ( 22.2%) 0.0021 ( 39.4%) 0.0021 ( 39.4%) Seed Live Regs
0.0001 ( 1.9%) 0.0000 ( 1.5%) 0.0001 ( 1.8%) 0.0001 ( 1.8%) Spiller
0.0000 ( 0.2%) 0.0000 ( 2.1%) 0.0000 ( 0.4%) 0.0000 ( 0.3%) Evict
0.0046 (100.0%) 0.0006 (100.0%) 0.0052 (100.0%) 0.0052 (100.0%) Total
===-------------------------------------------------------------------------===
Instruction Selection and Scheduling
===-------------------------------------------------------------------------===
Total Execution Time: 83.6459 seconds (83.6467 wall clock)
---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name ---
80.1686 ( 97.8%) 1.6332 ( 96.8%) 81.8019 ( 97.8%) 81.8027 ( 97.8%) Instruction Scheduling
1.3140 ( 1.6%) 0.0162 ( 1.0%) 1.3302 ( 1.6%) 1.3301 ( 1.6%) Instruction Creation
0.1647 ( 0.2%) 0.0053 ( 0.3%) 0.1700 ( 0.2%) 0.1700 ( 0.2%) DAG Legalization
0.1666 ( 0.2%) 0.0024 ( 0.1%) 0.1689 ( 0.2%) 0.1689 ( 0.2%) Instruction Selection
0.0629 ( 0.1%) 0.0208 ( 1.2%) 0.0837 ( 0.1%) 0.0837 ( 0.1%) Vector Legalization
0.0351 ( 0.0%) 0.0019 ( 0.1%) 0.0370 ( 0.0%) 0.0370 ( 0.0%) DAG Combining 2
0.0298 ( 0.0%) 0.0002 ( 0.0%) 0.0300 ( 0.0%) 0.0300 ( 0.0%) Type Legalization
0.0150 ( 0.0%) 0.0018 ( 0.1%) 0.0168 ( 0.0%) 0.0168 ( 0.0%) DAG Combining 1
0.0023 ( 0.0%) 0.0051 ( 0.3%) 0.0074 ( 0.0%) 0.0074 ( 0.0%) Instruction Scheduling Cleanup
81.9589 (100.0%) 1.6870 (100.0%) 83.6459 (100.0%) 83.6467 (100.0%) Total
===-------------------------------------------------------------------------===
DWARF Emission
===-------------------------------------------------------------------------===
Total Execution Time: 0.0020 seconds (0.0020 wall clock)
---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name ---
0.0016 ( 83.6%) 0.0000 ( 72.7%) 0.0017 ( 83.5%) 0.0017 ( 83.4%) DWARF Exception Writer
0.0003 ( 16.4%) 0.0000 ( 27.3%) 0.0003 ( 16.5%) 0.0003 ( 16.6%) DWARF Debug Writer
0.0020 (100.0%) 0.0000 (100.0%) 0.0020 (100.0%) 0.0020 (100.0%) Total
===-------------------------------------------------------------------------===
... Pass execution timing report ...
===-------------------------------------------------------------------------===
Total Execution Time: 84.5942 seconds (84.5949 wall clock)
---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name ---
82.0843 ( 99.1%) 1.7079 ( 98.6%) 83.7922 ( 99.1%) 83.7930 ( 99.1%) X86 DAG->DAG Instruction Selection
0.3889 ( 0.5%) 0.0018 ( 0.1%) 0.3906 ( 0.5%) 0.3906 ( 0.5%) Greedy Register Allocator
0.0540 ( 0.1%) 0.0023 ( 0.1%) 0.0563 ( 0.1%) 0.0562 ( 0.1%) Live Variable Analysis
0.0510 ( 0.1%) 0.0004 ( 0.0%) 0.0514 ( 0.1%) 0.0513 ( 0.1%) Machine Common Subexpression Elimination
0.0442 ( 0.1%) 0.0017 ( 0.1%) 0.0459 ( 0.1%) 0.0459 ( 0.1%) X86 AT&T-Style Assembly Printer
0.0176 ( 0.0%) 0.0078 ( 0.4%) 0.0254 ( 0.0%) 0.0254 ( 0.0%) Machine Function Analysis
0.0194 ( 0.0%) 0.0005 ( 0.0%) 0.0199 ( 0.0%) 0.0199 ( 0.0%) Simple Register Coalescing
0.0129 ( 0.0%) 0.0039 ( 0.2%) 0.0167 ( 0.0%) 0.0167 ( 0.0%) Live Interval Analysis
0.0147 ( 0.0%) 0.0000 ( 0.0%) 0.0147 ( 0.0%) 0.0147 ( 0.0%) Virtual Register Rewriter
0.0117 ( 0.0%) 0.0010 ( 0.1%) 0.0128 ( 0.0%) 0.0128 ( 0.0%) Two-Address instruction pass
0.0116 ( 0.0%) 0.0007 ( 0.0%) 0.0124 ( 0.0%) 0.0124 ( 0.0%) Peephole Optimizations
0.0099 ( 0.0%) 0.0022 ( 0.1%) 0.0121 ( 0.0%) 0.0121 ( 0.0%) Prologue/Epilogue Insertion & Frame Finalization
0.0114 ( 0.0%) 0.0000 ( 0.0%) 0.0114 ( 0.0%) 0.0114 ( 0.0%) Machine Copy Propagation Pass
0.0113 ( 0.0%) 0.0000 ( 0.0%) 0.0113 ( 0.0%) 0.0113 ( 0.0%) Combine redundant instructions
0.0108 ( 0.0%) 0.0001 ( 0.0%) 0.0109 ( 0.0%) 0.0109 ( 0.0%) Combine redundant instructions
0.0106 ( 0.0%) 0.0003 ( 0.0%) 0.0109 ( 0.0%) 0.0109 ( 0.0%) Combine redundant instructions
0.0105 ( 0.0%) 0.0000 ( 0.0%) 0.0105 ( 0.0%) 0.0105 ( 0.0%) Combine redundant instructions
0.0104 ( 0.0%) 0.0001 ( 0.0%) 0.0105 ( 0.0%) 0.0105 ( 0.0%) Calculate spill weights
0.0103 ( 0.0%) 0.0001 ( 0.0%) 0.0103 ( 0.0%) 0.0103 ( 0.0%) Combine redundant instructions
0.0076 ( 0.0%) 0.0000 ( 0.0%) 0.0076 ( 0.0%) 0.0076 ( 0.0%) Remove dead machine instructions
0.0064 ( 0.0%) 0.0010 ( 0.1%) 0.0074 ( 0.0%) 0.0074 ( 0.0%) Slot index numbering
0.0058 ( 0.0%) 0.0000 ( 0.0%) 0.0059 ( 0.0%) 0.0059 ( 0.0%) Slot index numbering
0.0048 ( 0.0%) 0.0001 ( 0.0%) 0.0048 ( 0.0%) 0.0048 ( 0.0%) Scalar Replacement of Aggregates (DT)
0.0037 ( 0.0%) 0.0001 ( 0.0%) 0.0038 ( 0.0%) 0.0038 ( 0.0%) Dead Store Elimination
0.0036 ( 0.0%) 0.0000 ( 0.0%) 0.0036 ( 0.0%) 0.0036 ( 0.0%) Post-RA pseudo instruction expansion pass
0.0035 ( 0.0%) 0.0000 ( 0.0%) 0.0036 ( 0.0%) 0.0036 ( 0.0%) Early CSE
0.0033 ( 0.0%) 0.0000 ( 0.0%) 0.0033 ( 0.0%) 0.0033 ( 0.0%) Optimize for code generation
0.0032 ( 0.0%) 0.0000 ( 0.0%) 0.0032 ( 0.0%) 0.0032 ( 0.0%) Early CSE
0.0031 ( 0.0%) 0.0000 ( 0.0%) 0.0031 ( 0.0%) 0.0031 ( 0.0%) Execution dependency fix
0.0023 ( 0.0%) 0.0002 ( 0.0%) 0.0024 ( 0.0%) 0.0024 ( 0.0%) Basic CallGraph Construction
0.0021 ( 0.0%) 0.0000 ( 0.0%) 0.0021 ( 0.0%) 0.0021 ( 0.0%) Sparse Conditional Constant Propagation
0.0019 ( 0.0%) 0.0000 ( 0.0%) 0.0019 ( 0.0%) 0.0019 ( 0.0%) Aggressive Dead Code Elimination
0.0017 ( 0.0%) 0.0000 ( 0.0%) 0.0017 ( 0.0%) 0.0017 ( 0.0%) Simplify the CFG
0.0016 ( 0.0%) 0.0000 ( 0.0%) 0.0017 ( 0.0%) 0.0017 ( 0.0%) Debug Variable Analysis
0.0016 ( 0.0%) 0.0000 ( 0.0%) 0.0016 ( 0.0%) 0.0016 ( 0.0%) X86 FP Stackifier
0.0014 ( 0.0%) 0.0000 ( 0.0%) 0.0014 ( 0.0%) 0.0014 ( 0.0%) Reassociate expressions
0.0012 ( 0.0%) 0.0000 ( 0.0%) 0.0012 ( 0.0%) 0.0012 ( 0.0%) Interprocedural Sparse Conditional Constant Propagation
0.0009 ( 0.0%) 0.0000 ( 0.0%) 0.0009 ( 0.0%) 0.0009 ( 0.0%) Tail Call Elimination
0.0009 ( 0.0%) 0.0000 ( 0.0%) 0.0009 ( 0.0%) 0.0009 ( 0.0%) Process Implicit Definitions
0.0009 ( 0.0%) 0.0000 ( 0.0%) 0.0009 ( 0.0%) 0.0009 ( 0.0%) Expand ISel Pseudo-instructions
0.0008 ( 0.0%) 0.0000 ( 0.0%) 0.0008 ( 0.0%) 0.0008 ( 0.0%) MemCpy Optimization
0.0006 ( 0.0%) 0.0000 ( 0.0%) 0.0006 ( 0.0%) 0.0006 ( 0.0%) Remove unused exception handling info
0.0005 ( 0.0%) 0.0000 ( 0.0%) 0.0005 ( 0.0%) 0.0005 ( 0.0%) Simplify well-known library calls
0.0004 ( 0.0%) 0.0000 ( 0.0%) 0.0004 ( 0.0%) 0.0004 ( 0.0%) Simplify the CFG
0.0001 ( 0.0%) 0.0002 ( 0.0%) 0.0003 ( 0.0%) 0.0003 ( 0.0%) Virtual Register Map
0.0003 ( 0.0%) 0.0000 ( 0.0%) 0.0003 ( 0.0%) 0.0003 ( 0.0%) Simplify the CFG
0.0003 ( 0.0%) 0.0000 ( 0.0%) 0.0003 ( 0.0%) 0.0003 ( 0.0%) Value Propagation
0.0003 ( 0.0%) 0.0000 ( 0.0%) 0.0003 ( 0.0%) 0.0003 ( 0.0%) Lower 'expect' Intrinsics
0.0002 ( 0.0%) 0.0000 ( 0.0%) 0.0002 ( 0.0%) 0.0002 ( 0.0%) Insert stack protectors
0.0002 ( 0.0%) 0.0000 ( 0.0%) 0.0002 ( 0.0%) 0.0002 ( 0.0%) Inliner for always_inline functions
0.0002 ( 0.0%) 0.0000 ( 0.0%) 0.0002 ( 0.0%) 0.0002 ( 0.0%) Simplify the CFG
0.0002 ( 0.0%) 0.0000 ( 0.0%) 0.0002 ( 0.0%) 0.0002 ( 0.0%) Simplify the CFG
0.0002 ( 0.0%) 0.0000 ( 0.0%) 0.0002 ( 0.0%) 0.0002 ( 0.0%) Global Variable Optimizer
0.0002 ( 0.0%) 0.0000 ( 0.0%) 0.0002 ( 0.0%) 0.0002 ( 0.0%) Value Propagation
0.0001 ( 0.0%) 0.0000 ( 0.0%) 0.0001 ( 0.0%) 0.0001 ( 0.0%) Live Register Matrix
0.0001 ( 0.0%) 0.0000 ( 0.0%) 0.0001 ( 0.0%) 0.0001 ( 0.0%) X86 Maximal Stack Alignment Check
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Deduce function attributes
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Control Flow Optimizer
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Machine code sinking
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) MachineDominator Tree Construction
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Natural Loop Information
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Machine Natural Loop Construction
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Machine Loop Invariant Code Motion
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Dead Argument Elimination
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Natural Loop Information
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Remove unreachable machine basic blocks
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Dominator Tree Construction
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Exception handling preparation
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Eliminate PHI nodes for register allocation
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) MachineDominator Tree Construction
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Machine Block Frequency Analysis
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Tail Duplication
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Post RA top-down list latency scheduler
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Branch Probability Basic Block Placement
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) MachineDominator Tree Construction
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Dominator Tree Construction
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Branch Probability Analysis
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Dominator Tree Construction
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Local Stack Slot Allocation
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Dominator Tree Construction
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Jump Threading
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Strip Unused Function Prototypes
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Scalar Evolution Analysis
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Dominator Tree Construction
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Bundle Machine CFG Edges
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Stack Slot Coloring
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Dominator Tree Construction
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Memory Dependence Analysis
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Natural Loop Information
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Basic Alias Analysis (stateless AA impl)
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Machine Natural Loop Construction
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Jump Threading
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Target Library Information
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Spill Code Placement Analysis
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Merge disjoint stack slots
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Tail Duplication
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Machine Loop Invariant Code Motion
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Lower Garbage Collection Instructions
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Analyze Machine Code For Garbage Collection
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Scalar Evolution Analysis
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Remove unreachable blocks from the CFG
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Optimize machine instruction PHIs
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Live Stack Slot Analysis
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Bundle Machine CFG Edges
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Scalar Replacement of Aggregates (SSAUp)
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Lazy Value Information Analysis
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Create Garbage Collector Module Metadata
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Memory Dependence Analysis
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Delete Garbage Collector Information
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Target Pass Configuration
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Target Library Information
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Machine Module Information
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Machine Branch Probability Analysis
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Lazy Value Information Analysis
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Basic Alias Analysis (stateless AA impl)
82.8616 (100.0%) 1.7326 (100.0%) 84.5942 (100.0%) 84.5949 (100.0%) Total
===-------------------------------------------------------------------------===
Miscellaneous Ungrouped Timers
===-------------------------------------------------------------------------===
---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name ---
83.1012 ( 50.0%) 1.7594 ( 50.2%) 84.8606 ( 50.0%) 84.8660 ( 50.0%) Clang front-end timer
82.8838 ( 49.9%) 1.7383 ( 49.6%) 84.6221 ( 49.9%) 84.6275 ( 49.9%) Code Generation Time
0.0640 ( 0.0%) 0.0064 ( 0.2%) 0.0704 ( 0.0%) 0.0704 ( 0.0%) LLVM IR Generation Time
166.0490 (100.0%) 3.5041 (100.0%) 169.5531 (100.0%) 169.5639 (100.0%) Total
real 1m24.886s
user 1m23.106s
sys 0m1.766s
Was just putting that in...
I'm assuming -pre-RA-sched=source has the same problem.
I'd like to replace the SD scheduler pass completely with a SD serialization pass. That won't happen for at least another month. But when it does happen I'll be able to close this.
Appears so and I'm good with that solution.
How about you take this and close then?
Dear Andrew,
We are getting more and more reports of this. Do you have an updated estimate? Your first one ("at least another month") was already correct ;-) Or did you replace the pass and that didn't help?
Cheers, Axel.
Thanks for pinging me on this! I have not been able to work on replacing the SD scheduler pass, so should have committed a quick fix earlier. I'm waiting on benchmark results but should have a quick workaround checked in Monday at the latest.
Fixed in r205738. I added a workaround to ClusterNeighboringLoads.
time /b/fix/RA/bin/clang++ -O2 -c biggraph.C
real 0m6.143s
user 0m6.016s
sys 0m0.120s
biggraph.C
(789356 bytes, application/octet-stream)