Synthesis has extremely bad runtime complexity in a few key places

While working on a project, I've repeatedly encountered some odd cases with the Clash compiler that I've tried to boil down into some examples. The basic gist of it is that the compiler is probably quadratic in a few places or just has some compounded bad complexity in others. These examples were triggered by automatically generated code which we've fed into the Clash compiler.

(Christiaan and I have already worked some of this out in private, but this is the public report to keep track of things, as they're otherwise just endless private chat conversations.)

There are two examples I have that I'll try to show off, and then do some analysis on the problems. I'll just post the first one here.

Scenario: A million little top level CAFs

Here's a single scenario I have: I have a bunch of top level functions, which are black boxes, and they are used by a bunch of other top level functions as well. The blackbox functions all have a known type and an accompanying Test.json, and the module overall looks like this:

module Test where

bb0 :: () -> Signal (Maybe ...)
bb0 = pure Nothing
{-# NOINLINE bb0 #-}

bb1 :: Signal (Maybe ...) -> Signal (Maybe ...)
bb1 = pure Nothing
{-# NOINLINE bb0 #-}

bb2 :: Signal (Maybe ...) -> Signal (Maybe ...)
bb2 = pure Nothing
{-# NOINLINE bb0 #-}

bb3 :: Signal (Maybe ...) -> Signal (Maybe ...)
bb3 = pure Nothing
{-# NOINLINE bb0 #-}

Next, I have a bunch of top level CAFs that call these blackboxes and "pass the results around". There is a final top-level entity with a TopEntity annotation that ties it all together:

foo0 = bb0 ()
foo1 = bb1 foo0 
foo2 = bb2 foo1 foo0
... many more definitions ...
fooN = bbN ...

{-# ANN top (defTop { ... }) #-}
top :: () -> Signal (Maybe ...)
top = \() -> fooN

The real example has hundreds (or thousands) of expressions using this same exact pattern.

Analysis

So there's a lot going on here, but the basic gist of it that I've seen after a bunch of experiments is that, I think, callGraph, used in several places, has bad complexity. callGraph is used in many different places in order to determine a global set of called functions, given a root binding. e.g. it's used to determine if a function is recursive, or a set of functions is mutually recursive, or if a term has any other callers and can be inlined, etc.

Here is the definition of callGraph from CLaSH.Normalize.Utils:

-- | Create a call graph for a set of global binders, given a root
callGraph :: [TmName] -- ^ List of functions that should not be inspected
          -> HashMap TmName (Type,SrcSpan,Term) -- ^ Global binders
          -> TmName -- ^ Root of the call graph
          -> [(TmName,[TmName])]
callGraph visited bindingMap root
  | Just rootTm <- HashMap.lookup root bindingMap
  = let  used   = Set.toList $ Lens.setOf termFreeIds (rootTm ^. _3)
         node   = (root,used)
         other  = concatMap (callGraph (root:visited) bindingMap) (filter (`notElem` visited) used)
    in   node : other
callGraph _ _ _ = []

This approach is naive but the complexity appears to be rather bad. For one it does an O(n) conversion in the size of the used variables for every single node. I was also suspicious of the complexity of concatMap somewhat, but I haven't pinned this down precisely. A similar problem exists elsewhere, which is in mkCallTree: https://github.com/clash-lang/clash-compiler/blob/c9d78125c474cde31fa5d0234f00e362368c85ee/clash-lib/src/CLaSH/Normalize.hs#L224

This function is basically a copy of callGraph but it changes the underlying type.

Even worse is here: https://github.com/clash-lang/clash-compiler/blob/c9d78125c474cde31fa5d0234f00e362368c85ee/clash-lib/src/CLaSH/Driver.hs#L362

In this section, checkNonRecursive calls callGraph, but then immediately calls cleanupGraph, which invokes mkCallTree, which is almost identical to callGraph itself. I believe fixing both of these -- or better, unifying the creation of the call tree and the call graph -- would be good.

Working around it.

I've been able to work around this to a limited extent using a set of patches. These are available on this branch here, but they're all horrid hacks. Here are some descriptions:

Disable the lambdaDrop pass. Commit. This disables the lambdaDrop pass which is only needed for certain mutually recursive binders. callGraph seems to explode here. Ideally, GHC would be fixed so we could remove this code, but this probably isn't happening soon unless we worked it into 8.2.2 somehow (by asking nicely).
Disable the cleanupGraph pass. Commit. This disables the cleanupGraph pass. Similar reason as before. However, this pass is really really important, because it ends up removing useless RTL modules before synthesis by inlining "trivial" wrappers. Without this, an example like the above with just a few hundred terms can generate thousands of VHDL files.
Check size first in smallInline. Commit. It's possible to disable inlining of small functions by setting the inline threshold to 1 (i.e. "all functions are too big and should not be inlined"). The problem here is that isRecursiveBndr is used, which again touches callGraph, but we check recursion first before checking the term size, even though there's really no need. So this is more of a workaround for the complexity problem. (However, I think it's also a legitimate patch, since there's no need to run the analysis if the term size is too big). Source)

I also made some changes to simplify callGraph itself and an attempt to improve its complexity:

Simplify type of callGraph. Commit. This is an easy and obvious patch that can be committed upstream, and is just a cleanup.
Improve callGraph complexity. Commit. This is an attempt to improve callGraph by a few constant factors, by removing lots of expensive conversions and only doing a final one at the end. This did seem to improve things in my profile, but...

Bad complexity for normalization

Here's an obvious O(n^2) case: https://github.com/clash-lang/clash-compiler/blob/529d5df2e97bd87366014ed3f09dd3aff048c0a5/clash-lib/src/CLaSH/Normalize.hs#L142

First, we get all the keys of the previously normalized things. Then, we filter out all out all of the previously used keys from the binders of the current thing -- but this is clearly O(n^2) where n = # number of top level binders -- upon visiting a single binding, we effectively have to visit every prior one by filtering linearly.

This can also be fixed by using something with better characteristics like a HashSet, but the naive approach is still O(n) for just using HashSet.difference, so a better approach will be needed. I've got a WIP branch but I have not got it working.

Yet another problem...

There is a third problem, however, it's trickier and requires a separate example (close to the above, but not exactly the same), so I'll put it in another ticket. This one requires having a billion terms not as top-level CAFs, but a single CAF (not a blackbox) which has many terms.

A bunch of profiling details and samples should go here

My stupid USB drive might have died so I have all these on my laptop. I'll copy them over or something. I also have some examples you can use to reproduce these various issues.

Sooooooo... I ended up going down many rabbit holes in an attempt to further understand this (I didn't really understand it so well, obviously), but I found my current culprit. Here's my IRC dump from earlier, with a major memory leak I discovered for the particular code I had, after changing the code generator. With this, Clash can actually compile the circuits I have in a reasonable amount of memory, finally.

The input code in question

I had another case like the above, except instead of having a lot of top-level bindings, I have code like this (this was a different approach used to work around aforementioned complexity bugs)

module Test where

bb0 :: () -> Signal (Maybe ...)
bb0 = pure Nothing
{-# NOINLINE bb0 #-}

bb1 :: Signal (Maybe ...) -> Signal (Maybe ...)
bb1 = pure Nothing
{-# NOINLINE bb0 #-}

bb2 :: Signal (Maybe ...) -> Signal (Maybe ...)
bb2 = pure Nothing
{-# NOINLINE bb0 #-}

bb3 :: Signal (Maybe ...) -> Signal (Maybe ...)
bb3 = pure Nothing
{-# NOINLINE bb0 #-}

{-# ANN top (defTop { ... }) #-}
top :: () -> Signal (Maybe ...)
top = \() ->
  let foo0 = bb0 ()
      foo1 = bb1 foo0 
      foo2 = bb2 foo1 foo0
      ... many more definitions ...
      fooN = bbN ...
  in fooN

So instead, I floated all the bindings inward to a let, inside the body of the top entity. I have this same pattern with thousands of black boxes and local bindings.

The problem

It's explained as thus:

14:20:27 <thoughtpolice> christiaanb: I think I found a major source of the leak in this case, after so much dead-ending, but I don't know how to solve it more generally. bindConstantVars takes an absolutely huge amount of RAM and time, and if I simply eliminate it for my case (which is OK), I can get away with Clash "only" using 16GB of RAM.
14:22:55 <thoughtpolice> When you have a bunch of let binders like 'let x = f (); g = h x x; z = f g; ...' over and over again, where every binder essentially is used linearly by other binders -- bindConstantVar trips on all of these, because they're obviously constant, so it inlines and propogates every single binder into every other binder. So it inlines 'x' into 'g'. Then in also inlines 'g' into 'z', and so on and so forth.
14:23:15 <thoughtpolice> But that means that every single time you do this, as bindConstantVars progresses, it has to analyze and retain progressively more-massive terms
14:23:30 <thoughtpolice> By the time you've done this process with like 900 let bindings, bindConstantVar is being called on RHS definitions that are like
14:23:46 <thoughtpolice> 10,000 terms themselves. It just takes ungodly amount of time to look at all the terms to ensure it's constant.
14:25:06 <thoughtpolice> Like, at the point where the compiler uses so much RAM that it just effectively stops doing anything, you have a term like 'let foo_bar123 = f (g (h (f (g (h (f (g (h ...)))))' but that RHS is really like 10,000 nested applications.
14:25:38 <thoughtpolice> And because that binder IS constant, it cannot terminate early. It has to look at the entire term and say, definitively, "Yes it is constant, so now you inline foo_bar123 into its call sites. It's a constant let binding, of course"
14:25:40 <thoughtpolice> And the cycle repeats
14:26:35 <thoughtpolice> Except it's even worse than that
14:26:42 <thoughtpolice> foo_bar123 might be used immediately after in a term like
14:26:52 <thoughtpolice> let foo_bar124 = some_func foo_bar123 foo_bar123

So essentially, by inlining 'trivial' bindings regardless of their size, bindConstantVar ends up generating absolutely massive programs, given the original input program see above. And in order to determine if a binding can be propagated safely, you must look at the whole term. So as each successive let binder gets larger, the memory usage grows and grows.

Naturally this leads to an immense explosion of code that cannot be rectified. GHC itself of course takes the size of a binder into account when inlining; there is a point where it will stop. Here's GHC's logic for doing this:

15:21:24 <thoughtpolice> christiaanb: FWIW, it might be OK to inline some constant bindings I think, the real problem ofc is that there's no threshold.
15:22:01 <thoughtpolice> Well, that's the problem if you want to inline some things. If you don't then it doesn't matter. In theory the synthesis tool may make up for it.
15:22:15 <thoughtpolice> But, if you want to read, here's how GHC calculates "unfolding size"
15:22:30 <thoughtpolice> Which is, broadly speaking, the basic heuristic of whether something should get inlined (more or less) based on its "size"
15:22:31 <thoughtpolice> https://github.com/ghc/ghc/blob/a273c735ca268988ee1550c248ba88b57a227cb9/compiler/coreSyn/CoreUnfold.hs#L391

Clash currently has no such logic to check the 'size' of a binder.

The fix

It turns out, bindConstantVar is really only used to clean up the code for prettier output, according to @christiaanb. It's not strictly necessary to propagate bindings everywhere. And ideally the synthesis tool will take care of it (I hope).

So for now, the easiest fix is this commit:

clash-ghc: add a flag to disable bindConstantVar. Commit.

There's hopefully a better strategy to allow us to generate "nice" code more easily than this. At the very least, this particular one can be somewhat fixed by taking binding size into account and not inlining extremely large ones. This would prevent every single let binding from being propagated endlessly.

clash-lang / clash-compiler