clash-lang / clash-compiler

Haskell to VHDL/Verilog/SystemVerilog compiler
https://clash-lang.org/
Other
1.44k stars 154 forks source link

Partial evaluation #950

Open christiaanb opened 4 years ago

christiaanb commented 4 years ago

Clash's existing compilation method of exhaustive rewriting, while significantly improved over the years, turns out to be a large performance bottleneck; especially when forced to unroll loops. Specifically, it's the successive traversal of the expression ADT to basically achieve compile-time evaluation in a very round-about way that is hurting us the most.

The idea is to repurpose Clash's existing WHNF-evaluator to an NF evaluator as a first pass of the compiler. Given that it was specifically designed for compile-time evaluation, it is much faster at this job than the exhaustive rewrite system. An early prototype of this approach did, however, show a lot of circuit duplication. So the final implementation should:

  1. Make sure that the new NF-evaluator has techniques to stop duplication
  2. Improve the existing Common-Subexpression-Elimination pass to recover more sharing.
  3. Not only prevent duplication in the circuit, but also to prevent the NF-evaluator from re-evaluating already evaluated expression (e.g. using some sort of tag)
  4. Attempt a parallel implementation, e.g. evaluate case-branches in parallel. This needs investigation with regard to sharing the evaluation of common subexpressions, and sharing/merging evaluator data-structures.

Various thoughts

Relevant literature

Remarks with regards to the above literate: their QOR metrics talk about allocation and code size, which don't really apply to our use case; we care about eventual circuit size. This means we should make different trade-offs then the above literature, but what's nice is that they do point out where such trade-offs/decisions can be made. Additionally, we should expect that we get to unroll all recursion; any recursion that cannot be fully unrolled basically isn't a structural description of a circuit (which is the view that Clash has on Haskell programs). We have to check how this aspect will influence our termination methods: I think we still should have termination measures in place, Clash shouldn't loop forever in case it's given an incorrect structural hardware description.

Hopefully fixes issues:

alex-mckenna commented 4 years ago

In a mildly annoying turn of events, the most recent generic-trie on Hackage doesn't support base-4.13.0.0 (used by GHC 8.8.1). As the most recent commit on the GitHub for that project supports it, a reference to that specific commit has been added to the cabal.project file as a source repository.

This should be removed as soon as generic-trie is next released on Hackage.

alex-mckenna commented 4 years ago

Update: Another paper, Taming Code Explosion in Supercompilation, seems to offer a reasonable approach to reducing the size of code after supercompilation. It might be adaptable to reducing circuit duplication in designs.

alex-mckenna commented 4 years ago

Update: the latest PR for this issue (#1288) provides the common semantics for the evaluator, going all the way to beta-normal eta-long form (NF). However, it should be noted that this is not the full partial evaluator as intended, but simply a good place to stop and get feedback. As I see it, there are three problems to be solved before the level of whole program optimization we're after is achieved: