Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

Large increase in RISC-V tblgen build times #48937

Open Quuxplusone opened 3 years ago

Quuxplusone commented 3 years ago
Bugzilla Link PR49968
Status NEW
Importance P enhancement
Reported by Simon Pilgrim (llvm-dev@redking.me.uk)
Reported on 2021-04-15 02:57:28 -0700
Last modified on 2021-05-17 11:36:47 -0700
Version trunk
Hardware PC Windows NT
CC asb@lowrisc.org, craig.topper@gmail.com, fraser@codeplay.com, llvm-bugs@lists.llvm.org, paul@windfall-software.com
Fixed by commit(s)
Attachments
Blocks PR28222
Blocked by
See also

The RISC-V tblgen builds have been getting slower and slower, and now appears to be the slowest of the big three (AMDGPU, X86 and RISC-V), especially if I end up having to use debug builds.....

I realise slower builds is almost inevitable as RISC-V develops and adds features, but it'd be great if there was some low hanging cleanup that could be done to get this under control - whether its improvements to tblgen itself or the RISC-V td files get more efficient.

Quuxplusone commented 3 years ago

Thanks for the report Simon. Paul: I'm adding you to the CC on the off-chance you've done any profiling recently of tblgen or had any planned performance improvements in the works (just to avoid duplicated effort).

Quuxplusone commented 3 years ago

No, I haven't done any profiling and I have no performance enhancements on my to-do list. Don't forget the llvm-tblgen -time-phases option, which can give you a quick idea where the time is being spent.

Quuxplusone commented 3 years ago

Do you have some idea which backends are chewing up the most time?

Quuxplusone commented 3 years ago
Adding in some data I've generated from the RISC-V backend and intrinsics. The
other tablegen targets take less than two second a piece.

[8/42] Building RISCVGenGlobalISel.inc...
===-------------------------------------------------------------------------===
                             TableGen Phase Timing
===-------------------------------------------------------------------------===
  Total Execution Time: 15.9415 seconds (17.0387 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
  14.8536 ( 95.4%)   0.3255 ( 86.4%)  15.1791 ( 95.2%)  15.4707 ( 90.8%)  Backend overall
   0.7109 (  4.6%)   0.0510 ( 13.5%)   0.7619 (  4.8%)   1.5641 (  9.2%)  Parse, build records
   0.0005 (  0.0%)   0.0000 (  0.0%)   0.0005 (  0.0%)   0.0040 (  0.0%)  Write output
  15.5649 (100.0%)   0.3766 (100.0%)  15.9415 (100.0%)  17.0387 (100.0%)  Total

[9/42] Building RISCVGenInstrInfo.inc...
===-------------------------------------------------------------------------===
                             TableGen Phase Timing
===-------------------------------------------------------------------------===
  Total Execution Time: 16.0652 seconds (17.1984 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
  14.4727 ( 91.9%)   0.2624 ( 81.2%)  14.7351 ( 91.7%)  14.9221 ( 86.8%)  Analyze DAG patterns
   0.7123 (  4.5%)   0.0441 ( 13.7%)   0.7564 (  4.7%)   1.7012 (  9.9%)  Parse, build records
   0.1840 (  1.2%)   0.0166 (  5.1%)   0.2006 (  1.2%)   0.2011 (  1.2%)  Emit helper methods
  15.7420 (100.0%)   0.3232 (100.0%)  16.0652 (100.0%)  17.1984 (100.0%)  Total

[2/5] Building RISCVGenDAGISel.inc...
===-------------------------------------------------------------------------===
                             TableGen Phase Timing
===-------------------------------------------------------------------------===
  Total Execution Time: 18.6654 seconds (18.8064 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
  13.9062 ( 76.2%)   0.3322 ( 81.5%)  14.2384 ( 76.3%)  14.2747 ( 75.9%)  Parse patterns
   2.2140 ( 12.1%)   0.0000 (  0.0%)   2.2141 ( 11.9%)   2.2195 ( 11.8%)  Convert to matchers
   0.6775 (  3.7%)   0.0520 ( 12.8%)   0.7295 (  3.9%)   0.7349 (  3.9%)  Parse, build records
   0.5580 (  3.1%)   0.0000 (  0.0%)   0.5580 (  3.0%)   0.5593 (  3.0%)  Sort patterns
   0.5381 (  2.9%)   0.0000 (  0.0%)   0.5381 (  2.9%)   0.5394 (  2.9%)  Optimize matchers
   0.3553 (  1.9%)   0.0199 (  4.9%)   0.3753 (  2.0%)   0.3762 (  2.0%)  Emit matcher table
   0.0085 (  0.0%)   0.0034 (  0.8%)   0.0120 (  0.1%)   0.1024 (  0.5%)  Write output
  18.2577 (100.0%)   0.4076 (100.0%)  18.6654 (100.0%)  18.8064 (100.0%)  Total

No surprises there, I suppose.
Quuxplusone commented 3 years ago

Yep, those are usually the three culprits. I should add more instrumentation to GlobalISel and InstrInfo so we can see which backend phases take the time. Then we can look at the longest phases and see if there is anything we can do.

In DagISel, the pattern parsing might be able to be improved.

Quuxplusone commented 3 years ago

I think a lot of time is spent in GenerateVariants. Do we need to do that for InstrInfo?

Quuxplusone commented 3 years ago

GenerateVariants() is a phase of building a CodeGenDAGPatterns instance. Instances a built by DAGIselEmitter, FastIselEmitter, GlobalIselEmitter. InstrDocsEmitter, InstrInfoEmitter.

GenerateVariants() has one double-nested loop and one triple-nested loop.The first loop calls GenerateVariantsOf(), which is pretty hairy.

I don't know enough about variants to get a feeling for how this might be improved. I can spend some time studying it. Do you have a sense of where it could be improved?

Quuxplusone commented 3 years ago

There were two patches in 2018 that were intended to improve the performance of GenerateVariants https://reviews.llvm.org/D51035 and https://reviews.llvm.org/D51839 on debug builds on MSVC. They might have slowed down Release builds or non-MSVC builds. Need to do some more investigation.

Quuxplusone commented 3 years ago

Or they are less helpful on RISCV than they were on X86.

Quuxplusone commented 3 years ago
I'm going to see if I can switch the predicates in PatternToMatch back to a
simple to compare type like ListInit* which is what it was prior to the
introduction of HwMode in 2017. That might allow us to remove the large
std::vector of BitVector from GenerateVariants.
Quuxplusone commented 3 years ago

Patches to try to improve things https://reviews.llvm.org/D100691 https://reviews.llvm.org/D100695

Quuxplusone commented 3 years ago
Simon, are you able to tell how much things have improved? My debug build on my
Linux server is currently about 29 seconds to generate RISCVGenInstrInfo.inc,
but I think MSVC time might be higher?

Summary of changes made so far
-Expand variants before expanding HwMode. Since expand variants is quadratic in
the number of patterns, it is better to do it before we double the patterns by
expanding hw mode.
-Patterns are stored in PatternToMatch now using ListInit * making them cheap
to compare in GenerateVariants where we used to have a std::vector compare.
-Removed BitVector based predicate filtering in GenerateVariants. This removed
a bunch of heap overhead for storing a nxn matrix of bits where n is the number
of patterns.

Minor things:
-Replaced std::set or DenseSet with SmallSet or MachineValueTypeSet which
should cut down on some heap allocation or hashing.
Quuxplusone commented 3 years ago
(In reply to Craig Topper from comment #12)
> Simon, are you able to tell how much things have improved? My debug build on
> my Linux server is currently about 29 seconds to generate
> RISCVGenInstrInfo.inc, but I think MSVC time might be higher?

Yeah, MSVC tends to be slower for debug builds.

I'm away from my main dev machine for the holiday weekend so haven't had chance
to compare full builds yet.

Fraser - any chance you can repeat those timing reports against trunk latest
please?
Quuxplusone commented 3 years ago
Certainly. As before, just using -time-phases in a release build:

[12/17] Building RISCVGenGlobalISel.inc...
===-------------------------------------------------------------------------===
                             TableGen Phase Timing
===-------------------------------------------------------------------------===
  Total Execution Time: 5.4829 seconds (5.7914 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   4.5548 ( 86.4%)   0.1697 ( 81.0%)   4.7245 ( 86.2%)   4.8007 ( 82.9%)  Backend overall
   0.7185 ( 13.6%)   0.0398 ( 19.0%)   0.7583 ( 13.8%)   0.9906 ( 17.1%)  Parse, build records
   0.0001 (  0.0%)   0.0000 (  0.0%)   0.0002 (  0.0%)   0.0002 (  0.0%)  Write output
   5.2734 (100.0%)   0.2095 (100.0%)   5.4829 (100.0%)   5.7914 (100.0%)  Total

[11/46] Building RISCVGenInstrInfo.inc...
===-------------------------------------------------------------------------===
                             TableGen Phase Timing
===-------------------------------------------------------------------------===
  Total Execution Time: 5.3306 seconds (5.7432 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   4.0939 ( 79.8%)   0.1364 ( 68.3%)   4.2302 ( 79.4%)   4.3308 ( 75.4%)  Analyze DAG patterns
   0.6909 ( 13.5%)   0.0633 ( 31.7%)   0.7542 ( 14.1%)   1.0660 ( 18.6%)  Parse, build records
   0.1299 (  2.5%)   0.0000 (  0.0%)   0.1299 (  2.4%)   0.1300 (  2.3%)  Emit helper methods
   0.1100 (  2.1%)   0.0000 (  0.0%)   0.1100 (  2.1%)   0.1101 (  1.9%)  Emit map table
   0.0437 (  0.9%)   0.0000 (  0.0%)   0.0437 (  0.8%)   0.0437 (  0.8%)  Emit InstrDesc records
   0.0258 (  0.5%)   0.0000 (  0.0%)   0.0258 (  0.5%)   0.0258 (  0.4%)  Emit operand info
   0.0117 (  0.2%)   0.0000 (  0.0%)   0.0117 (  0.2%)   0.0117 (  0.2%)  Emit operand type mappings
   0.0104 (  0.2%)   0.0000 (  0.0%)   0.0104 (  0.2%)   0.0104 (  0.2%)  Emit instruction names
   0.0035 (  0.1%)   0.0000 (  0.0%)   0.0035 (  0.1%)   0.0035 (  0.1%)  Emit logical operand size mappings
   0.0035 (  0.1%)   0.0000 (  0.0%)   0.0035 (  0.1%)   0.0035 (  0.1%)  Emit uses/defs
   0.0034 (  0.1%)   0.0000 (  0.0%)   0.0034 (  0.1%)   0.0034 (  0.1%)  Emit logical operand type mappings
   0.0033 (  0.1%)   0.0000 (  0.0%)   0.0033 (  0.1%)   0.0033 (  0.1%)  Emit operand name mappings
   0.0009 (  0.0%)   0.0000 (  0.0%)   0.0009 (  0.0%)   0.0009 (  0.0%)  Write output
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Emit initialization routine
   5.1310 (100.0%)   0.1997 (100.0%)   5.3306 (100.0%)   5.7432 (100.0%)  Total

[13/16] Building RISCVGenDAGISel.inc...
===-------------------------------------------------------------------------===
                             TableGen Phase Timing
===-------------------------------------------------------------------------===
  Total Execution Time: 6.6774 seconds (7.0148 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   3.6783 ( 56.4%)   0.1099 ( 71.8%)   3.7882 ( 56.7%)   3.8855 ( 55.4%)  Parse patterns
   1.2059 ( 18.5%)   0.0000 (  0.0%)   1.2059 ( 18.1%)   1.2068 ( 17.2%)  Convert to matchers
   0.7158 ( 11.0%)   0.0365 ( 23.9%)   0.7524 ( 11.3%)   0.9905 ( 14.1%)  Parse, build records
   0.4000 (  6.1%)   0.0000 (  0.0%)   0.4000 (  6.0%)   0.4006 (  5.7%)  Sort patterns
   0.2748 (  4.2%)   0.0000 (  0.0%)   0.2748 (  4.1%)   0.2752 (  3.9%)  Optimize matchers
   0.2464 (  3.8%)   0.0066 (  4.3%)   0.2530 (  3.8%)   0.2532 (  3.6%)  Emit matcher table
   0.0031 (  0.0%)   0.0000 (  0.0%)   0.0031 (  0.0%)   0.0031 (  0.0%)  Write output
   6.5244 (100.0%)   0.1530 (100.0%)   6.6774 (100.0%)   7.0148 (100.0%)  Total

One thing that may be worth mentioning now is that the various MC emitters
(RISCVGenAsmWriter.inc, RISCVGenMCCodeEmitter.inc,
RISCVGenDisassemblerTables.inc, RISCVGenAsmMatcher.inc) take 1.2 - 1.4 seconds
each. Now that the other emitters have been reduced well, these are taking up a
larger percentage of the time. It's 70-80% "Parse, build records" time for
those.
Quuxplusone commented 3 years ago

I'm going to spend some time in the next few weeks looking at TableGen parsing to see if there is anything that can be done relatively easily to speed it up. I'm not optimistic, but it's worth a good look.

Quuxplusone commented 3 years ago

Hang on, are those two sets of timings comparing apples with apples? If so, nice!

Quuxplusone commented 3 years ago

Thanks - I can confirm my build times have definitely improved!

Quuxplusone commented 3 years ago

Has anyone any ideas on what to do next on this?

This is a very general ticket, so if we have more specific improvements in mind, I'm happy for this to be resolved and those improvements to be handled in their own tickets.

Quuxplusone commented 3 years ago

If it helps, I just looked into the fairly nebulous "Analyze DAG patterns" and found that the call to CodeGenDAGPatterns::GenerateVariants takes about 50% of the -gen-instr-info time. I don't know how easy that would be to optimize.

Quuxplusone commented 3 years ago

I spent a couple of days fooling around with the TableGen lexer trying to speed it up. No joy there. I could knock off 1–2% of the time, which was probably just noise.

I think we can close this ticket.

Quuxplusone commented 3 years ago
(In reply to Paul C. Anagnostopoulos from comment #20)
> I spent a couple of days fooling around with the TableGen lexer trying to
> speed it up. No joy there. I could knock off 1–2% of the time, which was
> probably just noise.
>
> I think we can close this ticket.

I don't agree - until we've identified the next things to work on to reduce
tblgen  runtimes, we shouldn't close this.
Quuxplusone commented 3 years ago

A few ideas

-Call InferInstructionFlags and VerifyInstructionFlags before GenerateVariants and ExpandHwModeBasedTypes and don't call GenerateVariants or ExpandHwModeBasedTypes for GenInstrInfo.inc. This assumes that GenInstrInfo.inc doesn't care about the expanded patterns and only cares about the mayLoad/mayStore/etc flags and that those flags don't depend on the expansion.

-Replace the std::map in InfoByHwMode by a "small" data structure of some sort. It might be possible to just maintain a sorted SmallVector. The number of modes is small so we should be able to quickly do ordered inserts. This may not give us much, but we currently are guaranteed a heap allocation every time this or its subclasses are used. I think I have a prototype of this.

Quuxplusone commented 3 years ago

Okay, we can certainly keep it open.