Open christiaanb opened 4 years ago
In a mildly annoying turn of events, the most recent generic-trie
on Hackage doesn't support base-4.13.0.0
(used by GHC 8.8.1). As the most recent commit on the GitHub for that project supports it, a reference to that specific commit has been added to the cabal.project
file as a source repository.
This should be removed as soon as generic-trie
is next released on Hackage.
Update: Another paper, Taming Code Explosion in Supercompilation, seems to offer a reasonable approach to reducing the size of code after supercompilation. It might be adaptable to reducing circuit duplication in designs.
Update: the latest PR for this issue (#1288) provides the common semantics for the evaluator, going all the way to beta-normal eta-long form (NF). However, it should be noted that this is not the full partial evaluator as intended, but simply a good place to stop and get feedback. As I see it, there are three problems to be solved before the level of whole program optimization we're after is achieved:
Evaluating primitives
The new evaluator does not yet have any rules for evaluating primitives. A lot of this work was started in the partial-evaluator-2 branch and should be tidied up and moved over. In particular, we want to keep the sensible coercions between Value
and types in Haskell.
Unrolling recursion We want to be able to unroll and evaluate recursive definitions in the partial evaluator. However, this is not possible in every case (Haskell is not a total language and some terminating Haskell programs may require non-strict evaluation to converge on a value). We could implement something more naive here, but I think a better way would be to write a termination analyser to be run before normalisation (as we want to be able to partially evaluate as far as possible). The foetus termination checker of Andreas Abel looks to be the best bit for this.
Removing transformations As mentioned above, several transformation passes become redundant when the partial evaluator is implemented as intended. Removing these and folding their behaviour into the semantics of the evaluator is a fairly large PR in itself, and one which would benefit from being reviewed separately.
Clash's existing compilation method of exhaustive rewriting, while significantly improved over the years, turns out to be a large performance bottleneck; especially when forced to unroll loops. Specifically, it's the successive traversal of the expression ADT to basically achieve compile-time evaluation in a very round-about way that is hurting us the most.
The idea is to repurpose Clash's existing WHNF-evaluator to an NF evaluator as a first pass of the compiler. Given that it was specifically designed for compile-time evaluation, it is much faster at this job than the exhaustive rewrite system. An early prototype of this approach did, however, show a lot of circuit duplication. So the final implementation should:
Various thoughts
Value
should containTick
ed andCast
ed values, this way the evaluator can preserve both. Ticks must be maintained because they contain register and instance names. Adding these values means we have to update the primitive evaluation rules (https://github.com/clash-lang/clash-compiler/blob/master/clash-ghc/src-ghc/Clash/GHC/Evaluator.hs), as they need to be able to look through ticks (alacoreView
). Additionally, some casts may have to be translated toPrim
s, such as casts fromInteger
toUnsigned
appProp
: dropcaseLet
: dropcaseCon
: dropcaseCase
: drop, as mentioned earlier, is essential part of PEcaseElemNonReachable
: drop, but fold functionality into PEelemExistentials
: drop, but fold functionality into PEinlineNonRep
: dropinlineOrLiftNonRep
: droptypeSpec
: keepnonRepSpec
: keepetaExpansionTL
: keepnonRepANF
: maybe keep; seemakeANF
bindConstantVar
: dropconstantSpec
: keepmakeANF
: maybe keep. The "official" normal-form as described in Baaij (2015) prescribes ANF; but we've already started to deviate from that quite a bit. Some parts of theNetlist
stage still expect partial ANF (e.g. RHS of an alternative must be a variable if the case-expression is a projection). So perhaps we can reducemakeANF
to something simpler so that just theseNetlist
expectations are met.makeANF
is something that also enabled oursimpleCSE
pass, but when PE will do CSE, we'll dropsimpleCSE
.deadCode
: maybe keep, let's see if we make it part of PE (e.g. do GC on the evaluator's heap)topLet
: keeprecToLetRec
: keep; this might be another reason to keep fullmakeANF
, asrecToLetRec
is expecting expression in the "official" normal-form of Baaij (2015)inlineWorkFree
: dropinlineSmall
: dropsimpleCSE
: drop, CSE should become part of PEreduceConst
: dropreduceNonRepPrim
: drop; PE should be unrolling primitives in generalcaseFlat
: keepdisjointExpressionConsolidation
: keepremoveUnusedExpr
: keepinlineCleanup
: keepflattenLet
: keepsplitCastWork
: drop, PE should handle castsinlineCast
: dropcaseCast
: drop, PE should propagate castsletCast
: drop, PE should propagate castseliminateCastCast
: maybe keep, see if functionality can be folded into PEargCastSpec
: maybe keep, seeeliminateCastCast
etaExpandSyn
, keepappPropFast
, dropseperateArguments
, keepxOptimize
, drop; make part of PETrie
fromTerm
toHeapPointer
; so we can look up whether equal/equivalent partially evaluated expressions already live on the heap before we start evaluating them, and if so, just replace it to a reference heap variable. Don't forget to put an update frame on the stack, so that after evaluation, the expression is replaced by the heap variable reference again. We can use e.g. http://hackage.haskell.org/package/generic-trie to get us a start at aTrie
; and only roll our own if it turns out to be the bottleneck.register
-like things, so we must not forget to addsetName
ticks based on the current closest let-binder when we put arguments on the heap to defer evaluation. Either that, or give the heap-binder the sameString
part as the closest let-binding.Relevant literature
Remarks with regards to the above literate: their QOR metrics talk about allocation and code size, which don't really apply to our use case; we care about eventual circuit size. This means we should make different trade-offs then the above literature, but what's nice is that they do point out where such trade-offs/decisions can be made. Additionally, we should expect that we get to unroll all recursion; any recursion that cannot be fully unrolled basically isn't a structural description of a circuit (which is the view that Clash has on Haskell programs). We have to check how this aspect will influence our termination methods: I think we still should have termination measures in place, Clash shouldn't loop forever in case it's given an incorrect structural hardware description.
Hopefully fixes issues:
121
236
300
613