christiaanb commented 4 years ago

Clash's existing compilation method of exhaustive rewriting, while significantly improved over the years, turns out to be a large performance bottleneck; especially when forced to unroll loops. Specifically, it's the successive traversal of the expression ADT to basically achieve compile-time evaluation in a very round-about way that is hurting us the most.

The idea is to repurpose Clash's existing WHNF-evaluator to an NF evaluator as a first pass of the compiler. Given that it was specifically designed for compile-time evaluation, it is much faster at this job than the exhaustive rewrite system. An early prototype of this approach did, however, show a lot of circuit duplication. So the final implementation should:

Make sure that the new NF-evaluator has techniques to stop duplication
Improve the existing Common-Subexpression-Elimination pass to recover more sharing.
Not only prevent duplication in the circuit, but also to prevent the NF-evaluator from re-evaluating already evaluated expression (e.g. using some sort of tag)
Attempt a parallel implementation, e.g. evaluate case-branches in parallel. This needs investigation with regard to sharing the evaluation of common subexpressions, and sharing/merging evaluator data-structures.

Various thoughts

Value should contain Ticked and Casted values, this way the evaluator can preserve both. Ticks must be maintained because they contain register and instance names. Adding these values means we have to update the primitive evaluation rules (https://github.com/clash-lang/clash-compiler/blob/master/clash-ghc/src-ghc/Clash/GHC/Evaluator.hs), as they need to be able to look through ticks (ala coreView). Additionally, some casts may have to be translated to Prims, such as casts from Integer to Unsigned
The partial evaluator basically replaces https://github.com/clash-lang/clash-compiler/blob/9624b40f8c79e9b7bbec5f10973c33369c4c93f7/clash-lib/src/Clash/Normalize/Strategy.hs#L57; i.e. we shouldn't just add partial evaluation, we should also significantly strip down the normalization phase to the bare minimum.
- appProp: drop
- caseLet: drop
- caseCon: drop
- caseCase: drop, as mentioned earlier, is essential part of PE
- caseElemNonReachable: drop, but fold functionality into PE
- elemExistentials: drop, but fold functionality into PE
- inlineNonRep: drop
- inlineOrLiftNonRep: drop
- typeSpec: keep
- nonRepSpec: keep
- etaExpansionTL: keep
- nonRepANF: maybe keep; see makeANF
- bindConstantVar: drop
- constantSpec: keep
- makeANF: maybe keep. The "official" normal-form as described in Baaij (2015) prescribes ANF; but we've already started to deviate from that quite a bit. Some parts of the Netlist stage still expect partial ANF (e.g. RHS of an alternative must be a variable if the case-expression is a projection). So perhaps we can reduce makeANF to something simpler so that just these Netlist expectations are met. makeANF is something that also enabled our simpleCSE pass, but when PE will do CSE, we'll drop simpleCSE.
- deadCode: maybe keep, let's see if we make it part of PE (e.g. do GC on the evaluator's heap)
- topLet: keep
- recToLetRec: keep; this might be another reason to keep full makeANF, as recToLetRec is expecting expression in the "official" normal-form of Baaij (2015)
- inlineWorkFree: drop
- inlineSmall: drop
- simpleCSE: drop, CSE should become part of PE
- reduceConst: drop
- reduceNonRepPrim: drop; PE should be unrolling primitives in general
- caseFlat: keep
- disjointExpressionConsolidation: keep
- removeUnusedExpr: keep
- inlineCleanup: keep
- flattenLet: keep
- splitCastWork: drop, PE should handle casts
- inlineCast: drop
- caseCast: drop, PE should propagate casts
- letCast: drop, PE should propagate casts
- eliminateCastCast: maybe keep, see if functionality can be folded into PE
- argCastSpec: maybe keep, see eliminateCastCast
- etaExpandSyn, keep
- appPropFast, drop
- seperateArguments, keep
- xOptimize, drop; make part of PE
To do CSE during PE, we should probably have a Trie from Term to HeapPointer; so we can look up whether equal/equivalent partially evaluated expressions already live on the heap before we start evaluating them, and if so, just replace it to a reference heap variable. Don't forget to put an update frame on the stack, so that after evaluation, the expression is replaced by the heap variable reference again. We can use e.g. http://hackage.haskell.org/package/generic-trie to get us a start at a Trie; and only roll our own if it turns out to be the bottleneck.
We need to do case-of-case in PE (no surprise there), i.e. when the subject of a case-expression reduces to a case-expression in NF, we do the case-of-case transformation before evaluating the alternatives of the outer case-expressions to NF.
LHS-naming is very important for register-like things, so we must not forget to add setName ticks based on the current closest let-binder when we put arguments on the heap to defer evaluation. Either that, or give the heap-binder the same String part as the closest let-binding.

Relevant literature

Remarks with regards to the above literate: their QOR metrics talk about allocation and code size, which don't really apply to our use case; we care about eventual circuit size. This means we should make different trade-offs then the above literature, but what's nice is that they do point out where such trade-offs/decisions can be made. Additionally, we should expect that we get to unroll all recursion; any recursion that cannot be fully unrolled basically isn't a structural description of a circuit (which is the view that Clash has on Haskell programs). We have to check how this aspect will influence our termination methods: I think we still should have termination measures in place, Clash shouldn't loop forever in case it's given an incorrect structural hardware description.

Hopefully fixes issues:

121
236
300
613

alex-mckenna commented 4 years ago

In a mildly annoying turn of events, the most recent generic-trie on Hackage doesn't support base-4.13.0.0 (used by GHC 8.8.1). As the most recent commit on the GitHub for that project supports it, a reference to that specific commit has been added to the cabal.project file as a source repository.

This should be removed as soon as generic-trie is next released on Hackage.

alex-mckenna commented 4 years ago

Update: Another paper, Taming Code Explosion in Supercompilation, seems to offer a reasonable approach to reducing the size of code after supercompilation. It might be adaptable to reducing circuit duplication in designs.

alex-mckenna commented 4 years ago

Update: the latest PR for this issue (#1288) provides the common semantics for the evaluator, going all the way to beta-normal eta-long form (NF). However, it should be noted that this is not the full partial evaluator as intended, but simply a good place to stop and get feedback. As I see it, there are three problems to be solved before the level of whole program optimization we're after is achieved:

Evaluating primitives The new evaluator does not yet have any rules for evaluating primitives. A lot of this work was started in the partial-evaluator-2 branch and should be tidied up and moved over. In particular, we want to keep the sensible coercions between Value and types in Haskell.
Unrolling recursion We want to be able to unroll and evaluate recursive definitions in the partial evaluator. However, this is not possible in every case (Haskell is not a total language and some terminating Haskell programs may require non-strict evaluation to converge on a value). We could implement something more naive here, but I think a better way would be to write a termination analyser to be run before normalisation (as we want to be able to partially evaluate as far as possible). The foetus termination checker of Andreas Abel looks to be the best bit for this.
Removing transformations As mentioned above, several transformation passes become redundant when the partial evaluator is implemented as intended. Removing these and folding their behaviour into the semantics of the evaluator is a fairly large PR in itself, and one which would benefit from being reviewed separately.

clash-lang / clash-compiler

Partial evaluation #950

Various thoughts

Relevant literature

Hopefully fixes issues:

121

236

300

613