chipsalliance / chisel

Chisel: A Modern Hardware Design Language
https://www.chisel-lang.org/
Apache License 2.0
4.01k stars 603 forks source link

Chisel3 compile time much slower than Chisel2 #143

Closed colinschmidt closed 8 years ago

colinschmidt commented 8 years ago

Chisel 3 runtime is about 9-10 minutes for this configuration of Hwacha to generate FIRRTL. Chisel 2 creates verilog for this config in 3 minutes.

Chisel3 Output

[info] Running rocketchip.TestGenerator rocketchip Top ISCA2016Config --W0W --minimumCompatibility 3.0.0 --backend v --configName ISCA2016Config --compileInitializationUnoptimized --targetDir /scratch/colins/rocket-chip/vsim/generated-src --configDump --noInlineMem
[info] [0.001] Elaborating design...
Generated Address Map
        conf:devicetree 40000000 - 40007fff
        conf:csr0 40008000 - 4000ffff
        conf:scr 40010000 - 400101ff
Generated Configuration String
platform {
  vendor ucb;
  arch rocket;
};
ram {
  0 {
    addr 0;
    size 0x40000000;
  };
};
core {
  0 {
    0 {
      isa rv64;
      addr 0x40008000;
    };
  };
};

UNCORE_SCR: 40010000 -> N_CORES
UNCORE_SCR: 40010001 -> MMIO_BASE
UNCORE_SCR: 40010002 -> MEMORY_CHANNEL_MUX_SELECT
UNCORE_SCR: 4001003f -> HTIF_IO_CLOCK_DIVISOR
[info] [519.435] Done elaborating.

Chisel2 Output:

[info] Running rocketchip.TestGenerator rocketchip Top ISCA2016Config --W0W --minimumCompatibility 3.0.0 --backend v --configName ISCA2016Config --compileInitializationUnoptimized --targetDir /scratch/colins/rocket-chip/vsim/generated-src --configDump --noInlineMem
Generated Address Map
        conf:devicetree 40000000 - 40007fff
        conf:csr0 40008000 - 4000ffff
        conf:scr 40010000 - 400101ff
Generated Configuration String
platform {
  vendor ucb;
  arch rocket;
};
ram {
  0 {
    addr 0;
    size 0x40000000;
  };
};
core {
  0 {
    0 {
      isa rv64;
      addr 0x40008000;
    };
  };
};

UNCORE_SCR: 40010000 -> N_CORES
UNCORE_SCR: 40010001 -> MMIO_BASE
UNCORE_SCR: 40010002 -> MEMORY_CHANNEL_MUX_SELECT
UNCORE_SCR: 4001003f -> HTIF_IO_CLOCK_DIVISOR
[info] [126.447] // COMPILING < (class rocketchip.Top)>(14)
...
[info] [171.427] COMPILING <Top (class rocketchip.Top)> 14 CHILDREN (9,0)
colinschmidt commented 8 years ago

You can run this code by checking out the hwacha branch and typing make in vsim

sdtwigg commented 8 years ago

Do you have a breakdown of the time spent? At the highest level, time starting up chisel3 (sbt overhead), time of pre-firrtl invocation chisel3, time in firrtl, time in verilator, etc. (Then, we can also consider finer time points like time spend actually elaborating the graph, serializing/deserializing the graph to go into firrtl, time of individual firrtl passes. In parallel, can look into plugging in a Java profiler.)

A cursory examination of chisel3 and firrtl yields some potential for improvement but it would be good to first get a baseline of where the slowdowns are.

(As a concrete example, I've been tempted to introduce to introduce to firrtl a caching trick I used in gama to bring nested hashCode calculations into constant times. This can be further extended to bring nested tree equality checks closer to constant time if that is helpful.)

aswaterman commented 8 years ago

I cache hashCodes in the FIRRTL optimization passes I've added. But it sounds like in this case Chisel3 is the culprit.

colinschmidt commented 8 years ago

All of that time in pre-firrtl invocation. I guess some scala profiling of chisel3 itself is necessary.

aswaterman commented 8 years ago

I'll look into it.

colinschmidt commented 8 years ago

I had about 30 vec.fills in bundles remaining from chisel2 conversion. Fixing these to use the Vec constructor cut down the time to 6 minutes

[info] [360.144] Done elaborating.
aswaterman commented 8 years ago

Yeah, there are two problems... the chisel3 frontend is about 3x slower than chisel2 on this kind of code, and the coding style is bad for chisel/firrtl/vcs performance overall. There are probably places where Vecs of Bool can be replaced by UInts, so that the operations proceed in a data-parallel fashion.

colinschmidt commented 8 years ago

I agree I could replace Vec of Bools with UInts, but Vec of Bools essentially provides me with the "new Bits" class that only supports logical operators.

aswaterman commented 8 years ago

Is that really worth the incredible compile time and simulation performance cost that the little bit of safety provides? How often are people accidentally doing arithmetic on these things?

aswaterman commented 8 years ago

So, a lot of the time is being spent in cloneType method calls that result from creating Vecs. I'm looking into the possibility of calling cloneType lazily.

sdtwigg commented 8 years ago

Vec currently subclasses IndexedSeq, which generally assumes accessing all elements is okay.

e.g. strictness will be forced when hashCode is called or equality is checked against another sequence collection (technically a GenSeqLike descendent). Actually, this means Vecs have the same equals semantics as other Seq descendants, which is somewhat weird (particularly for Vecs that happen to be empty).

aswaterman commented 8 years ago

@colinschmidt profiling shows that a huge (40%) fraction of runtime is spent throwing MatchErrors stemming from PrivateConfigs.scala:56, Configs.scala:306, and hwacha//configs.scala:9.

This doesn't happen for rocket-chip, which suggests an inefficiency or deficiency in the Hwacha configurations. I do see that CDE regularly catches MatchErrors -- perhaps @hcook can shed some light on what's going on.

FYI, exceptions don't need to be slow: if you declare your own exceptions, and override fillInStackTrace to do nothing, then throwing and catching them is fast indeed. But if you use built-in exceptions that just extend Throwable, they are extremely slow.

aswaterman commented 8 years ago

@colinschmidt I got that figure wrong; hprof tells me 96% of the time is spent in the parameter system... so if you fix that, you're back down into to the 20 second range

aswaterman commented 8 years ago

@sdtwigg yeah, I realized my suggestion was silly shortly after making it.

sdtwigg commented 8 years ago

How exposed is cde? I looked into it and am having trouble formulating a fix for the "exceptions as flow control" behavior you noted without API-level fixes. (Also, I don't think scala.MatchError is overridable since it comes from the compiler/runtime.)

Edit: cde = https://github.com/ucb-bar/context-dependent-environments

aswaterman commented 8 years ago

@sdtwigg it's 6 months old and rocket-chip is the principal user... @hcook could pipe up but I think it's still possible to make API changes. In particular, if what @colinschmidt is seeing is really the intended behavior, the performance hit justifies an API change.

A backwards-compatible fix would be to suggest that users terminate their match statements with

case _ => throw cde.MatchError

which extends scala.MatchError with fillInStackTrace overridden. Or something along those lines.

sdtwigg commented 8 years ago

scala.MatchError is final but we can just extend the catches in cde to catch either scala.MatchError or a new cde.MatchError.

colinschmidt commented 8 years ago

ok thanks for investigating. I'll talk to @hcook about whats going on with cde in hwacha compared to stock rocket-chip.

sdtwigg commented 8 years ago

(It is hard to find in code but, I can somewhat safely assume that) you create the Config object for RocketChip + Hwacha with code like class HwachaRocketConfigs extends Config(new RocketConfig ++ new HwachaConfig).

However, in cde, what actually happens is that, for every parameter resolution, it first tries to find it in RocketConfig. If it fails, a match error is thrown and HwachaConfig is searched. So, nearly any parameter for Hwacha that is resolved requires throwing at least one match error first. (Actually, RocketConfig is itself built with ++ operations so multiple MatchErrors are likely being thrown until the Hwacha parameter is resolved.) Then, UsesHwachaParameters uses vals to cache parameter lookups (let's say 15). Thus, every Hwacha Module will require at least 15 exceptions to be handled.

PS: I actually looked at the associated sources after writing this. The structure is more HwachaConfig = TestConfig ++ HwachaConfig ++ L2Config So, TestConfig has barely any parameters in it so every Module that extends WithHwachaParams will invoke ~15 exceptions. Then, Rocket uses the same WithCoreParameter trick (which extends another HasAddrMapParameters), so you are looking at ~25 parameters. L2Config itself is composed with a ++. Ultimately, in Hwacha, every Rocket CoreModule invokes ~75 exceptions. This assumes the actual construction of a Module does not 'alter' the parameter object. (Each alter potentially adds up to ~25 exceptions per submodule.)

If you are blocked and need a couple of fixups you can reorganize RocketConfig so the most likely config resolver is first. (Swapping the L2Config ++ would cut off ~25 exceptions from CoreModule.) Then, instead of strictly caching the parameters with val, use lazy val so only Modules using the parameter incur the penalty (the parameter stack is no longer tracked dynamically so this is much safer to do). Finally (the true solution) is as we discussed to re-architect cde use better flow control, or at least a cheaper exception. (Actually, to ensure you catch all the small matches that need to be amended, you would want to, at least temporarily, disable catching MatchErrors in cde.)

colinschmidt commented 8 years ago

I thought that in order to correctly override the parameters in the L2Config it would need to be after the overriding config, so putting it at the front off the ++ chain would break things. What did you mean by "the parameter stack is no longer tracked dynamically so this is much safer to do"? Does this allow the reordering now?

sdtwigg commented 8 years ago

Ah, right, with regards to the first, you can't do it when the left side is supplying overriding parameters. (Does the withTests do any overrides? That doesn't look like it resolves many parameters and would be better as the last config if it can be.)

That statement was applied to the second case. In old chisel2, the parameter stack was tracked dynamically at runtime/hardware elaboration. This was dangerous because if a module or bundle was revisited later, the parameters may have shifted (likely causing clone methods to fail their contract).

In cde, parameters are passed via implicits and occasionally explicitly through the Module constructors, thus the parameters associated with a module/bundle is essentially resolved statically at scala compile time and 'cached' by the module. Thus, you don't have to worry about the parameter stack being different between resolutions (unless you are doing something very, very strange in your module). So, you can use a lazy val (which will delay resolution until actually needed).

colinschmidt commented 8 years ago

Seeing as this is a library issue I'm closing.