Mikolaj / horde-ad

Higher Order Reverse Derivatives Efficiently - Automatic Differentiation library based on the paper "Provably correct, asymptotically efficient, higher-order reverse-mode automatic differentiation"
BSD 3-Clause "New" or "Revised" License
31 stars 6 forks source link

Add a sharing mechanism to Ast terms #95

Open Mikolaj opened 1 year ago

Mikolaj commented 1 year ago

I'm now leaning towards replicating the trick we have in Delta terms (Tom's Tree0 method --- a stamp per each term node). Then we could add the same stamp to the Delta expressions created from Ast terms.

There are benefits. The user doesn't have to worry and let from Haskell implies this permanent sharing (not only temporary sharing in memory). We get stamps in fewer Delta terms, which does not incur a risk of sharing violation as long as interpretation of Ast terms don't duplicate newly created Delta terms (but it can duplicate delta terms that will be received from interpreted Ast subterms).

There are some drawbacks . We get many stamps in Ast terms: one stamp per Ast node. The implementation of interpretation of Ast in Tensor can't just have one extra case for AstLet, but instead needs to keep a table of interpreted subterms. Moreover, interpretation in an ADVal instance of Tensor has to pass on the stamps, so maybe this interpretation can't share code with all the others. I haven't thought through Vectorization nor Simplification, but given that they can incur duplication, we can't introduce sharing after they are complete. So they probably need to keep tables as well. I'm not sure if we can get away with stamps only in tensor terms, given that integer terms suffer from blowup when gathers are fused and consequently large terms are substituted for integer variables.

Mikolaj commented 1 year ago

BTW, in the YOLO paper, they seem to have a sharing preservation mechanism as well (section 5.1.2). I wonder how complete it is. Is it simper/better than ours in Delta expressions? Given that we now do source transform, can we use it, too? Or is our source transform too late, after the unrolling of Haskell control flow and datatypes duplicates terms too much already? If so, would we need to either transform the raw Haskell or make the user write in a single-assignment form or something similar (perhaps using the AstLet construct introduced for that)?

tomsmeding commented 1 year ago

In my reading, yolo section 5.1.2 doesn't so much describe a sharing preservation mechanism, as it does making sharing that was already there visibly (!) in the AST, also visible through a computational operation: dup. They start with

The non-linear fragment of Linear A allows variables to be referenced multiple times

which I read as saying: it has let and in the body of such lets, variables can be referenced multiple times. Hence the sharing was already visible without looking into memory representations.

So it seems YOLO just uses AstLet to indicate sharing always. (The dup doesn't indicate sharing, it indicates duplication as a fictional operational step in order to introduce a natural place for its dual, the very much non-fictional addition. Similarly drop is a fictional operational step introduced in order to be a natural place for the very non-fictional allocation of a zero value that is its dual.)

Why can't we? I feel a let-based representation is much more flexible to work with than IDs in the tree; the latter is handy when wanting to consume a tree with sharing and not doing too much work in the process. But any kind of program optimisation that would do something with let-bindings might be weird in the trees-with-IDs representation. There must be a reason that nobody does that. :D

Or maybe I'm wrong and that will work fine.

Mikolaj commented 1 year ago

Oh, I see, so they assume their program already has the sharing marked, and they just preserve it through the transformations. That's no help, unless we want horde-ad users to mark all sharing with explicit let, but that sounds cumbersome and risky (users are lazy). And automatically wrapping each subexpression in a honest let x = e1 in e2 is probably prohibitive wrt efficiency.

Regarding how to denote sharing, we have a compromise in our Delta expressions: the stamp-per-node is introduced by a fake let constructor:

LetR :: NodeId -> DeltaR n r -> DeltaR n r

which is a bit costly, but the ids don't clutter the rest of the syntax. The important decision is, though, that surface syntax has no impact on where LetR appears. The user doesn't indicate it in any way. But you are right, the ids are still a pain for the implementer.

Mikolaj commented 1 year ago

The problem with AstLet AstVarName (Ast n r) (Ast m r) is that I can't easily vectorize it. Let's say I have AstBuild1 k (\i -> AstVar var). To proceed I must look up var in an environment I have to pass along (and so far I didn't need any). Moreover, I need to record the result of vectorizing in the environment. But that's not far from adding an ID to each node and keeping a table of vectorization results for each ID. What am I missing?

tomsmeding commented 1 year ago

One thing to the advantage of the let representation is that an environment automatically shrinks when variables fall out of scope, whereas a table with all IDs will never shrink (unless you do some magic with weak pointers that I'm not sure will even work here).

About vectorising let, I believe I already wrote the following in an email sometime?

build (\i. let x = s in t)
~>
let x = build (\i. s) in build (\i. t[x!i / x])

Yes, that's an actual substitution so there will be some recomputation, but only of indexing which is constant-time anyway and fairly cheap. And furthermore if you try to turn that substitution into a let-binding to preserve sharing, you end up with infinite recursion in applying this rule. :)

Not an answer to all your questions, though perhaps already interesting.

Mikolaj commented 1 year ago

That answers all my questions and more. Now I never end up with build (\i -> x), but with build (\i -> x!i), where i does not occur in the term mapped to x by the environment, so this vectorizes to a simple gather (and then simplifies to just x via extra rules for gather simplification). I now see I would (will?) need to do the same with the Tree0 sharing.

automatically shrinks when variables fall out of scope

That's a fair point. It's such a pity we can't trace Haskell lets.

Hah, while adding let to Tensor class, using our usual trick to avoid mentioning explicit variables, I ended up with

tlet :: TensorOf n r -> (TensorOf n r -> TensorOf m r) -> TensorOf m r

but that's an unnatural name and argument order for

tapply :: (TensorOf n r -> TensorOf m r) -> TensorOf n r -> TensorOf m r

or should it be called tappWithSharing? tshare? tapplyShare?

I guess I'm accidentally circling back to our discussion of tracing functions, though this is about tracing applications. In fact, tracing function abstractions would have the same effect (and type signature and code)

  tlambdaShare :: (TensorOf n r -> TensorOf m r)
               -> (TensorOf n r -> TensorOf m r)
  tlambdaShare = id                

where the Ast instances have tlambdaShare f a = AstLet freshVar a (f freshVar) and the user is going to write

sumTrainableInputsL
  :: forall r. Tensor r
  => TensorOf 1 r -> [TensorOf 1 r]
  -> TensorOf 1 r
sumTrainableInputsL = tlambdaShare $ \x -> tlambdaShare $ \weights ->
  let f :: TensorOf 1 r -> TensorOf 0 r
      f v = v `tdot0` x
  in tfromList $ map f weights

though in this example (that's the likely culprit for the 1M times slowdown) the second tlambdaShare only incurs overhead, because weights is used linearly. Let me try to implement the last variant of "let".

Mikolaj commented 1 year ago

The good news is that this speeds up the 1M test.

The bad news are

With let .. in build .., I will have to redo my vectorization, which so far triggers only via

  tbuild1 = vectorize $ AstBuild1 ...

but now I need to run vectorization on the outermost let that defines a variable used in the build I'm after. In practice, I will probably need to vectorize whole program, and perhaps I won't be able to vectorize depth-first any more, that is, starting with innermost build1 and thanks to that assuming that build1 constructors never occur. It may also be hard to avoid vectorizing code that is not needed for build (because a code in let may be later used for build).

Mikolaj commented 1 year ago

The good news from above scales nicely --- with more parameters, the neutral network runs 10 times faster. This 10x is from [edit: reducing and probably even totally eliminating] the exponential blowup already in InterpretAst, without either vectorization or Ast inside Delta. Here's the change:

https://github.com/Mikolaj/horde-ad/commit/eaefa2a22499baaca245ea135fbbd4ba54290b96

Mikolaj commented 1 year ago

The conclusion after profiling is that tlet totally eliminated lack-of-sharing exponential blowup manifesting when interpreting a Tensor program in Ast and then interpreting the resulting Ast. That's the 1000x slowdown at the full size of the recent Mnist example.

The remaining 1000x slowdown, just as exponential, is that in Delta evaluation we accumulate Ast terms in cotangent contribution maps. The delta expressions are perfectly shared, but the Ast terms inside them are not (they have tlet inside, but no sharing with terms from other delta expressions), so they are added to the maps maintaining their sharing on the heap, but with no explicit sharing stamps. During the second and last interpretation of Ast terms, the ones representing the gradient, the sharing on the heap does not prevent an exponential number of interpretation calls.

I wonder how to express the sharing of Ast terms residing in different cells of the cotangent contribution maps.

tomsmeding commented 1 year ago

I wonder how to express the sharing of Ast terms residing in different cells of the cotangent contribution maps.

Do you have a small example of an Ast that results in something being evaluated at least twice where it should be computed once? I think I'm getting what you say, but looking at an example will help a lot.

Re tlet: so horde-ad did not recognise and preserve sharing in the input program when converting to Ast, but now the user can indicate that they will be using some value multiple times using tlet and this makes the Ast faithfully (assuming no user error) represent sharing? That's cool! And that also means, I think, that if we're willing to use stable names, as in e.g. data-reify1, we could avoid letting the user need to annotate shared values using tlet.


1 if you're having trouble seeing how the f to use that library, I've figured it out at some point and can dig up my working example code. The output is a graph, that is: if FAst n t r is the structure functor of Ast such that Fix (FAst n t) ~= Ast n t (where data Fix f = In (f (Fix f))), then the output of data-reify will be (mostly) a value of type Map Id (FAst n t Id). Indeed, Deref (Ast n t) should be FAst n t, if I'm not mistaken, and Id ~ Unique. Converting the graph form to let-expressions is an exercise for the reader (Accelerate has a publication on this).

tomsmeding commented 1 year ago

Err -- data-reify doesn't directly apply, of course, because that library assumes an untyped AST, whereas ours is typed. However, since ours is not scoped, simply typed (because it is in HOAS form -- the kind of Ast is Nat -> Type -> Type, not [Type] -> Nat -> Type -> Type), the full Accelerate-style hoas-conv, which produces let-binding form, is not quite required. A significantly simpler modification of data-reify will suffice to get it to work on typed terms, though I'm not sure how it will work on the higher-order terms.

Mikolaj commented 1 year ago

Do you have a small example of an Ast that results in something being evaluated at least twice where it should be computed once?

I don't, right now, but let me show you how the duplication arises. Let's say we have a delta term Add0 (Input0 0) (Input0 0):

https://github.com/Mikolaj/horde-ad/blob/fff6dd554be2b9bdf2ceef3d4accaab12f2743aa/simplified/HordeAd/Core/Delta.hs#L462-L468

Evaluating it with eval0 would result in a term AstOp PlusOp [c, c] in a map accumulating contangents. These two copies of c are shared on the heap, but there's no explicit sharing between them. Now, let's scale this duplication up. Let's replace Input0 0 with Let0 n large_delta_term, with the same stamp identifier n in both summands. One would think the explicit sharing should materialise. But no. The computation of the cotangent contribution c (from large_delta_term) that eventually landed at key n indeed was performed only once, but at the key corresponding the Add0 expression, we are going to see AstOp PlusOp [c, c], not explicitly shared.

But given that key n has been populated, at some point in the topological sorting order the n key of the cotangent contribution map is going to be looked up and eval0 s2 c2 d2 is going to be called, where c2 is our c + c. If now d2 is Add0, it adds c + c to c + c, all not shared. Exponential, despite perfect sharing of delta expressions.

One extra complication is that the gradient is a tuple of terms, one per input, so we can't even syntactically express let c = ... in (gradient_1, ..., gradient_n), because Ast doesn't have tuples.

Mikolaj commented 1 year ago

Complete and utter success (but without vectorization and simplification, so far). 500x speedup by adding lets in precisely the places we predicted during the last meeting

https://github.com/Mikolaj/horde-ad/commit/3cce1b9afa128e5b8abc2a39bebd93f7a64e607c

Possibly around 100x speedup (but it's exponential and I changed problem size, so the scale has changed) via adding two lets to the Tensor instance for dual numbers and specifically the tdot0 operation

https://github.com/Mikolaj/horde-ad/commit/9c560b3eeb12df5f4e47c6bc685ed0d107776425

and then possibly a 100x speedup at a yet larger scale by adding lets to all the dual number arithmetic classes instances (which matter only for dual numbers with Ast inside)

https://github.com/Mikolaj/horde-ad/commit/8fd66129c031fb31d5551935786099b4d6f0dcc8

In the end, our second pipeline (the one producing an Ast gradient) that was 1M times slower is now a bit faster than the first pipeline and only a bit slower than a pipeline that goes from Tensor straight to ADVal Double without creating any Ast terms whatsoever.

A problem is that we now have several places where sharing is introduced and a couple of kinds of sharing (not even counting the AstInt sharing which was not needed yet) that jostle with each other.

Another problem is that after the AD phase we no longer can apply simplification via term rewriting using our rules. The rewriting at this point breaks (e.g., via substitution) global sharing (the fake single-argument let), even though it works fine with local sharing (the ordinary let x = y in z). However, we can do local simplification when performing the final interpretation and it does matter for speed (at least when sharing is broken and so there's millions of terms; I haven't now tried disabling it).

A minor problem is that memoization tables for global sharing can't be shared between different iterations of the gather or scatter functions (nor with the program outside them). Local sharing would not have this problem. The workaround is not to have costly tensor operations inside indexes but instead materialize their results in a tensor and only look them up in the index expression.

Mikolaj commented 1 year ago

I wonder about vectorization of let (local sharing) again. If I have

tlet x = v1 in (tbuild1 2 (x + x + v2) + tbuild1 3 (x + v3) + (x + v4) + (x + v5))

should I vectorize v1 at width 2 or 3? Or should I remove a portion of the sharing and vectorize instead

tlet x = v1 in tbuild1 2 (x + x + v2) + tlet x = v1 in build 3 (x + v3) + tlet x = v1 in ((x + v4) + ((x + v5))

What if a build has 100 variables, and the tlet for each is a 1000 nodes up the term tree, where I need to find it and then replace the build term with a 100 lets and stick the build at the end? That's costly and not expressible with a rewriting system, Instead of a whole-program sweep to find bindings for each variable occurring in a build, we'd probably rather keep all lets in an environment. But that's getting close to working with a graph or a DAG with a table representing node sharing.

I wonder if the global sharing fares any better. I start with

tbuild1 2 (tlenR n v1 + tlenR n v1 + v2) + tbuild1 3 (tlenR n v1 + v3) + (tlenR n v1 + v4) + (tlenR n v1 + v5)

and let's focus on the first summand and vectorize it

tbuild1 2 (tlenR n v1 + tlenR n v1 + v2)
--> tbuild1 2 (tlenR n v1) + tbuild1 2 (tlenR n v1) + tbuild1 2 v2 
--> tbuild1 2 v1 + tbuild1 2 v1 + tbuild1 2 v2

the rules preserve equality fine, but I lost the sharing with the other summands, which clearly won't rewrite the tlenR n v1 in an equivalent way, so there's no hope of recovering the sharing. Even worse I've also lost the local sharing between the two tbuild1 2 v1 terms that do preserve their equality through vectorization.

It seems I can't get away from using a table as the state of the vectorization procedure. Let's repeat vectorization of the following, but with a memoization table

tbuild1 2 (tlenR n v1) + tbuild1 2 (tlenR n v1)

I vectorize the first summand to tbuild1 2 v1 and memoize that vectorization of node n at width 2 results in this term. Vectorization of the second summand then goes for free. However, I've lost sharing again. I should generate a fresh name m and mark the result in the table with it and so obtain

tlenR m (tbuild1 2 v1) + tlenR m (tbuild1 2 v1)

and this seems to work fine, cheaply, not loosing any sharing that can easily be preserved and without inspecting the whole program or constructing 100-element lets.

What did I miss?

Mikolaj commented 1 year ago

Oh, I see what I have missed: a the term bound in a let outside build can't contain the build variables, so we don't need to touch it. Doh.

tomsmeding commented 1 year ago

Yay for sharing, I guess! :)

A problem is that we now have several places where sharing is introduced and a couple of kinds of sharing (not even counting the AstInt sharing which was not needed yet) that jostle with each other.

Yes and this is concerning, not only from a correctness and efficiency perspective (do we have all sharing now?), but also from a paper-writing perspective: how is the poor reader (and future developers of horde-ad) going to understand this mess?

Another problem is that after the AD phase we no longer can apply simplification via term rewriting using our rules. The rewriting at this point breaks (e.g., via substitution) global sharing (the fake single-argument let), even though it works fine with local sharing (the ordinary let x = y in z). However, we can do local simplification when performing the final interpretation and it does matter for speed (at least when sharing is broken and so there's millions of terms; I haven't now tried disabling it).

I'm not sure I follow your argument here. So rewriting + global sharing doesn't combine, but rewriting + local sharing works fine? So it should work if we have proper lets in our trees instead of Tree0 notation? Is there an impediment to doing so? (Do we need to actively convert something from global sharing to let-style sharing? If so that's going to be annoying, but possible.)

A minor problem is that memoization tables for global sharing can't be shared between different iterations of the gather or scatter functions (nor with the program outside them). Local sharing would not have this problem. The workaround is not to have costly tensor operations inside indexes but instead materialize their results in a tensor and only look them up in the index expression.

This I also don't follow. Is this still relevant re your progress on let vectorisation tonight? (Which is great, by the way -- I was stumped by your first message and then shared your 'Doh' upon reading your second one)

Mikolaj commented 1 year ago

Yay for sharing, I guess! :)

Yes. I knew about hash-consing, but I had no idea ad-hoc sharing changes so much (removes three different sources of exponential runtime, in this case, not counting the Delta sharing). For discovering the joys of sharing I blame Simon and his fruitful insistence on source transform in place of my concrete ADVal, which had only one kind of sharing explosion potential (explosion of Delta terms size). Source transform is very powerful, but it's so fragile vs sharing.

Yes and this is concerning

We have to clean it up, but I'd like to first decide the fate of the user-visible sharing (tlet in Tensor). Ideally, that would be gone from the API and inserted automatically, even though it's going to cost us a multiplicative constant slowdown.

(do we have all sharing now?)

All but AstInt sharing, but it's fortunately not needed in any of the current tests and if we merge Ast and AstInt, which would be great, we'd just use the Ast sharing.

from a paper-writing perspective

Let's only mention the local tlet sharing there and elide the details of all the other sharing.

and future developers of horde-ad

poor buggers

So rewriting + global sharing doesn't combine, but rewriting + local sharing works fine?

Probably depends on the rules. Vectorization using your methods works great. Simplification is almost fine:

let x = fromList [t0, t1] in (x ! [0], x ! [0], ...)

does not simplify to let x = fromList [t0, t1] in (t0, x ! [0], ...), but is instead a normal form (if the three dots are), which is a bummer. But this is a valid trade-off, because t0 may be huge and shared with other terms, so we'd unshare it we'd simplify. No tables are needed, it's simple, it's obviously correct. Some rules, e.g., vectorization, require extra subsitutions, but the normal rule for indexing of a gather (from our paper) already does, so that's nothing exceptional.

The global version

((tletR n (fromList [t0, t1])) ! [0], (tletR n (fromList [t0, t1])) ! [0], ...)

on the other hand, simplifies to

(tletR n2 t0, tletR n2 t0, ...)

adding tletR n2 t0 to the projection memoization table at key (n, [0]). It's great that it simplifies, but the price is the always growing memo table. BTW, we could do something similar with tlet using memoization table and substitution, but one does not do local sharing to muck around with tables.

Which one is better is very context-dependent. But in general, I'd rather simplify less, but not pay an unbounded and unpredictable memory tax. I've been burned with hash-consing. Moreoever, global sharing is complex, stateful and not obviously correct.

So it should work if we have proper lets in our trees instead of Tree0 notation?

Yes.

Is there an impediment to doing so? (Do we need to actively convert something from global sharing to let-style sharing? If so that's going to be annoying, but possible.)

Two impediments. 1. The sharing in the transpose code can't even easily be expressed in the local way. 2. We indeed have to convert, if we naively add automatic sharing to user code, by wrapping each argument of Ast instance of Tensor with `tletR.

Other than the two places, sharing can be written using the civlized local let from the get go.

And Delta sharing is probably completely orthogonal and really well suited to the global method (Tree0), because we never rewrite Delta terms in the slightest and we do lots of global cross-referencing of them that is not captured by any syntax.

This I also don't follow. Is this still relevant re your progress on let vectorisation tonight? (Which is great, by the way -- I was stumped by your first message and then shared your 'Doh' upon reading your second one)

:)

No, I'm fine, the memoization being invalidated by substitutions or changing arguments to the gather function is only relevant for the global lets, while vectorization is intended to work on the local lets, at least for now and at least in the paper.

Mikolaj commented 1 year ago

I've just implemented vectorization of AstLet and it works fine. Not enough tests yet to benchmark, though, and not all speedups are ported to both kinds of let.

Mikolaj commented 1 year ago

I've added all the needed AstLet rules (two) to the paper.

I think the problem with expressing sharing of the gradient (doable) and of cotangent contributions during transposition (not really doable IMHO) is that the transposition can't be expressed as rewriting rules, because it's stateful. The best we can do is operational semantics, but the sharing we are after is the sharing inside the environments on the left hand side of the turnstile, not on the right hand side, in the terms. So, it can't be expressed in the syntax.

So probably the best we can do is keep the tletR global lets in the transposition code and convert them into local lets afterwards. However, it's not obvious this is possible. In the absence of lambda-abstraction global and local lets are not equivalent. E.g.

tfromList [tgather sh v (\i -> tLetInt n0 i), tgather sh u (\i -> tLetInt n0 i)]

can't be expressed with local lets. Fortunately, we don't have tLetInt yet and fortunately we most probably never insert such lets in the transpose code, but still...

  1. In the implementation of tlet for Ast, you might want to check whether the right-hand side is already a variable reference, in which case just return that variable reference expression immediately without creating an intermediate let. That will prevent many nested Add0 nodes from creating many redundant nested lets, where one would suffice.

I've implemented this and its analogue for the global lets. Apparently we don't have an evil enough test where this would be a bottleneck, but every bit of simplification helps, especially for readability when debugging.

Mikolaj commented 1 year ago

I'm stumped. I've implemented the ordinary local lets for transpose

https://github.com/Mikolaj/horde-ad/commit/21ad6caaf0c83b8746b2ca466d73ca9da7ddb216

but the best I can do is a huge string of lets before a term with no lets in it. The same we'd get via sharing recovery from global lets, I imagine. What remains to do now is https://github.com/Mikolaj/horde-ad/commit/8fd66129c031fb31d5551935786099b4d6f0dcc8#r107726007, but I have no idea how to do this reasonably.

The problem is that we need to share primal values (Ast terms) across the primal and dual components of a dual number. So the Ast-level let is not enough, because it only scopes over Ast terms and the dual numbers are not Ast terms (a is an Ast):

data ADVal a = D a (Dual a)

The only solution that comes to mind is storing the let bindings as a third argument of the D constructor. But then all the code involving D suddenly becomes full of boilerplate (e.g., concatenating the lets lists from two D arguments of a function). And the end result is again a huge string of lets before a term with no lets in it, so the locality of local lets is lost anyway.

I'm considering ripping out local lets from transpose and adding an optional sharing recovery pass, needed whenever simplification term rewriting pass is to be performed before interpretation pass(es) or when the Ast gradient is used to build up some more complex program, perhaps to be differentiated again.

This seems like a serious drawback of dual numbers. I bet CHAD doesn't have that problem.

Mikolaj commented 1 year ago

Showing off Tapenade feature-parity (as soon as we also support Fortran and COBOL), pretty-printing (based on CHAD and using Tom's hindent trick) and how local sharing (tlet) affects global sharing (tletR):

foo :: RealFloat a => (a, a, a) -> a
foo (x, y, z) =
  let w = x * sin y
  in atan2 z w + z * w

rev foo:
\s0 x2 x3 x4 dt ->
  dlet (x4 * dt)
       (\x6 ->
    dlet (negate (x4 * (tconst 1.0
                        / (x4 * x4 + (x2 * sin x3) * (x2 * sin x3))))
          * dt)
         (\x7 ->
      dmkDomains
        (fromList
           [ dfromR (tfromList [])
           , dfromR (sin x3 * x7 + sin x3 * x6)
           , dfromR (cos x3 * (x2 * x7) + cos x3 * (x2 * x6))
           , dfromR
               (((x2 * sin x3)
                 * (tconst 1.0
                    / (x4 * x4 + (x2 * sin x3) * (x2 * sin x3))))
                * dt
                + (x2 * sin x3)
                  * dt)
           ])))

visible sharing:
\s0 x2 x3 x4 dt ->
  dlet
    (x4 * dt)
    (\x6 ->
       dlet
         (negate
            (x4 *
             (tconst 1.0 /
              (x4 * x4 +
               tletR4 (x2 * tletR1 (sin x3)) * tletR4 (x2 * tletR1 (sin x3))))) *
          dt)
         (\x7 ->
            dmkDomains
              (fromList
                 [ dfromR (tfromList [])
                 , dfromR (tletR1 (sin x3) * x7 + tletR2 (sin x3) * x6)
                 , dfromR (cos x3 * (x2 * x7) + cos x3 * (x2 * x6))
                 , dfromR
                     ((tletR4 (x2 * tletR1 (sin x3)) *
                       (tconst 1.0 /
                        (x4 * x4 +
                         tletR4 (x2 * tletR1 (sin x3)) *
                         tletR4 (x2 * tletR1 (sin x3))))) *
                      dt +
                      tletR3 (x2 * tletR2 (sin x3)) * dt)
                 ])))

fooLet :: forall r n. (RealFloat (TensorOf n r), Tensor r, KnownNat n)
       => (TensorOf n r, TensorOf n r, TensorOf n r) -> TensorOf n r
fooLet (x, y, z) = tlet @r @n (x * sin y) $ \w -> atan2 z w + z * w

rev fooLet:
\s0 x2 x3 x4 dt ->
  dlet (negate (x4 * (tconst 1.0
                      / (x4 * x4 + (x2 * sin x3) * (x2 * sin x3))))
        * dt
        + x4 * dt)
       (\x7 ->
    dmkDomains
      (fromList
         [ dfromR (tfromList [])
         , dfromR (sin x3 * x7)
         , dfromR (cos x3 * (x2 * x7))
         , dfromR
             (((x2 * sin x3)
               * (tconst 1.0 / (x4 * x4 + (x2 * sin x3) * (x2 * sin x3))))
              * dt
              * (x2 * sin x3)
                * dt)
         ]))

visible sharing:
\s0 x2 x3 x4 dt ->
  dlet
    (negate
       (x4 *
        (tconst 1.0 /
         (x4 * x4 +
          tletR2 (x2 * tletR1 (sin x3)) * tletR2 (x2 * tletR1 (sin x3))))) *
     dt +
     x4 * dt)
    (\x7 ->
       dmkDomains
         (fromList
            [ dfromR (tfromList [])
            , dfromR (tletR1 (sin x3) * x7)
            , dfromR (cos x3 * (x2 * x7))
            , dfromR
                ((tletR2 (x2 * tletR1 (sin x3)) *
                  (tconst 1.0 /
                   (x4 * x4 +
                    tletR2 (x2 * tletR1 (sin x3)) *
                    tletR2 (x2 * tletR1 (sin x3))))) *
                 dt +
                 tletR2 (x2 * tletR1 (sin x3)) * dt)
            ]))

the corresponding primal part, sharing with the gradient Ast above:
\s0 x2 x3 x4 -> atan2 x4 (tletR2 (x2 * tletR1 (sin x3))) 
                + x4 * tletR2 (x2 * tletR1 (sin x3))
tomsmeding commented 1 year ago

Interesting how there is some sort of re-association going on between the two "visible sharing" versions!

The first has this structure:

let x6 = x4 * dt
in let x7 = bla dt
   in ... a * (b * x6) + a * (b * x7) ...

whereas the second has this structure:

let x7 = bla dt + x4 * dt
in ... a * (b * x7) ...

I didn't expect such reassociation to come from sharing preservation, but maybe I'm shortsighted here.

Also cool stuff! The global sharing representation is still counter-intuitive to read for me.

Mikolaj commented 1 year ago

Interesting how there is some sort of re-association going on between the two "visible sharing" versions!

Yes, this is quite curious. Notice how variable x6 must have been in use in the second variant, but got eliminated away somehow --- but not via simplification during vectorization, because the dlet sharing only emerges in transpose and, so far, we don't simplify afterwards. Now that I think about it, in the first example both x6 and x7 where used for terms that landed in the InputR map, while in the second example, x6 must only have been entered at most into the LetR map, which is abandoned after transpose is complete.

Something else that is clearly visible here, is how the local sharing in the surface language induces global sharing. Not by explicitly translating local variables to global identifiers, but by doing forward pass on some terms only once, not twice, which cases the global identifiers to be generated only once, as so shared across the two branches.

tomsmeding commented 1 year ago

Probably depends on the rules. Vectorization using your methods works great. Simplification is almost fine:

let x = fromList [t0, t1] in (x ! [0], x ! [0], ...)

does not simplify to let x = fromList [t0, t1] in (t0, x ! [0], ...), but is instead a normal form (if the three dots are), which is a bummer. But this is a valid trade-off, because t0 may be huge and shared with other terms, so we'd unshare it we'd simplify. No tables are needed, it's simple, it's obviously correct. Some rules, e.g., vectorization, require extra subsitutions, but the normal rule for indexing of a gather (from our paper) already does, so that's nothing exceptional.

Would it be warranted to add a simplification rule like this?

let x = fromList [e1, ..., en] in e
~>
let x1 = e1 in ... in let xn = en in e[fromList [x1,...,xn] / x]

It's not quite complexity preserving if n is unbounded. But maybe we should do this for small n only. Assuming sufficiently intelligent array representations, repeated fromLists of already-materialised arrays shouldn't even do all that much; but maybe that's too optimistic.

The global version

((tletR n (fromList [t0, t1])) ! [0], (tletR n (fromList [t0, t1])) ! [0], ...)

on the other hand, simplifies to

(tletR n2 t0, tletR n2 t0, ...)

Right, global sharing seems to have it much easier here. But it's much harder to maintain -- where does the n2 come from, for example? And what figures out that both tletR _ t0 terms have the same identifier?

  1. The sharing in the transpose code can't even easily be expressed in the local way.

Can't it? Which eval rule is problematic? (Or is it a combination?)

And Delta sharing is probably completely orthogonal and really well suited to the global method (Tree0), because we never rewrite Delta terms in the slightest and we do lots of global cross-referencing of them that is not captured by any syntax.

Yep!

I've added all the needed AstLet rules (two) to the paper.

That is, simplifying index-of-let and vectorising build-of-let?

the transposition can't be expressed as rewriting rules, because it's stateful.

It's only stateful because it's Cayley-transformed. The eval function:

eval :: R -> Delta -> DeltaMap -> DeltaMap

is really eval :: R -> Delta -> Endo DeltaMap: its codomain is Cayley-transformed (think DList difference lists) in order to make things more efficient. (addDelta becomes a bit more expensive this way, because it has to update a value in the map instead of being able to create a singleton map, but addition of maps is much cheaper because surely (.) is more efficient than Map.union.)

So maybe it can be expressed using rewrite rules, but just into the language of Endo DeltaMap, not of DeltaMap?

The best we can do is operational semantics, but the sharing we are after is the sharing inside the environments on the left hand side of the turnstile, not on the right hand side, in the terms. So, it can't be expressed in the syntax.

This I don't really follow unfortunately.

In the absence of lambda-abstraction global and local lets are not equivalent. E.g.

tfromList [tgather sh v (\i -> tLetInt n0 i), tgather sh u (\i -> tLetInt n0 i)]

can't be expressed with local lets.

(did you mean "presence", not "absence"?) But that global sharing is invalid, right? Those two is are distinct expressions with distinct semantics, because they refer to different variables. How can sharing them under the same id n0 ever work in an implementation? What is stored in the memoisation table?

As far as I know, any valid global sharing can be expressed as local sharing, although shared expressions might move a long way up the tree if they're used in textually distant parts of the code.

In the implementation of tlet for Ast, you might want to check whether the right-hand side is already a variable reference, in which case just return that variable reference expression immediately without creating an intermediate let. That will prevent many nested Add0 nodes from creating many redundant nested lets, where one would suffice.

I've implemented this and its analogue for the global lets. Apparently we don't have an evil enough test where this would be a bottleneck, but every bit of simplification helps, especially for readability when debugging.

Yeah maybe it doesn't matter much in practice. But it's an easy simplification to make, and intuitively it shouldn't ever make things worse -- the cost of checking whether the RHS is a variable node should always be offset by saving a let frame at execution time.

but the best I can do is a huge string of lets before a term with no lets in it.

You're right that it seems difficult to do better than this; it definitely lessens the appeal of using "local" lets. Hm, I hoped it wouldn't be this bad.

Would it be better if we made D aware of Ast, i.e. if we made a special-purpose DOfAst that has proper understanding of lets? Something like this:

data DOfAst d a = DOfAst (Ast (a, Dual d a))

i.e. produce the pair late instead of early. This will allow lets to build up naturally inside the AST, but does require you to add pairs to the Ast language.

Mikolaj commented 1 year ago

Would it be warranted to add a simplification rule like this?

let x = fromList [e1, ..., en] in e
~>
let x1 = e1 in ... in let xn = en in e[fromList [x1,...,xn] / x]

It's not quite complexity preserving if n is unbounded. But maybe we should do this for small n only. Assuming sufficiently intelligent array representations, repeated fromLists of already-materialised arrays shouldn't even do all that much; but maybe that's too optimistic.

Right. This may fix that problem. I will try when I revisit simplification rules next time.

Right, global sharing seems to have it much easier here. But it's much harder to maintain -- where does the n2 come from, for example? And what figures out that both tletR _ t0 terms have the same identifier?

Ouch, I've made a mistake. This should be memoized, but obviously it can't work with this precise sharing. We' need

(tletR k ((tletR n (fromList [t0, t1])) ! [0]), tletR k ((tletR n (fromList [t0, t1])) ! [0]), ...)

and then we memoize that k rewrites to tletR n2 t0 (where n2 is freshly generated) and we 1. rewrite it just once and subsequently only look up and 2. we consequently get the same n2 in both places. With only the n sharing recorded, we need to simplify both the occurrences separately and we lose their sharing. Which means I was barking up the wrong tree --- the main problem in this example wasn't the kind of sharing but that not everything that was identical was explicitly shared. In such a case local sharing leans towards not simplifying, while global sharing towards duplication of work and code, which is probably worse in most situations.

However, when we have local sharing with enough sharing expressed

let x = fromList [t0, t1] in let y = x ! [0] in (y, y, ...)

it's stuck and that's probably what I had in mind and indeed global sharing more naturally leads to the optimal outcome here.

Risking a premature generalization: if you have maximal sharing, the global style may be better, but with selective sharing (e.g., manual in response to benchmarking), local sharing is just fine, while being less complex.

This makes me reconsider the idea to do forward pass (e.g., the numerical classes for dual numbers) and transpose using global sharing and then perform a sharing recovery pass in order to simplify the resulting Ast before compilation to GPU. If the resultant local sharing is as dense as above, and it may well be, because binary operators appear a lot, then it may turn out that no simplification rule applies at all, because it all stops at let and variables. I wonder if extra rules such as yours may save the day.

tbc

tomsmeding commented 1 year ago

However, when we have local sharing with enough sharing expressed

let x = fromList [t0, t1] in let y = x ! [0] in (y, y, ...)

it's stuck and that's probably what I had in mind and indeed global sharing more naturally leads to the optimal outcome here.

Except if you add my let-fromList-inline rule from above with n >= 2, in which case this will simplify to let x0 = t0 in (x0, x0, ...).

Risking a premature generalization: if you have maximal sharing, the global style may be better, but with selective sharing (e.g., manual in response to benchmarking), local sharing is just fine, while being less complex.

This would be an interesting proposition. I wonder how much you can massage the two to be equivalent by adding well-chosen simplification rules like the above.

[...] then it may turn out that no simplification rule applies at all, because it all stops at let and variables.

But then simplification should include inlining of cheap computations. This is a standard part of compiler optimisation pipielines. My previous point stands -- I wonder if you can massage this to mostly not differ anymore. :)

Mikolaj commented 1 year ago
  1. The sharing in the transpose code can't even easily be expressed in the local way.

Can't it? Which eval rule is problematic? (Or is it a combination?)

I managed to overcome it, at the cost of complicating our Tensor class collection. The problem was not any rule but overall typing. The result of transpose is a bunch of TensorOf n r terms (each with different n!), which was not expressible using the Tensor classes. The tlet only worked on TensorOf stuff. Now we also have dlet, which works on bunches of tensors.

I've added all the needed AstLet rules (two) to the paper.

That is, simplifying index-of-let and vectorising build-of-let?

Yes.

the transposition can't be expressed as rewriting rules, because it's stateful.

It's only stateful because it's Cayley-transformed. [...]

Oh, cool. Let me digest this and come back.

The best we can do is operational semantics, but the sharing we are after is the sharing inside the environments on the left hand side of the turnstile, not on the right hand side, in the terms. So, it can't be expressed in the syntax.

This I don't really follow unfortunately.

Nothing deep: eval is stateful, so we have operational semantic rules like

s1 |- t -> s2
----------------
s3 |- G(t) -> s4

and the sharing that needs to be introduced is between components of s1 (and, actually, s2, s3 and s4), not between subterms of t and/or G(t).

In the absence of lambda-abstraction global and local lets are not equivalent. E.g.

tfromList [tgather sh v (\i -> tLetInt n0 i), tgather sh u (\i -> tLetInt n0 i)]

can't be expressed with local lets.

(did you mean "presence", not "absence"?) But that global sharing is invalid, right? Those two is are distinct expressions with distinct semantics, because they refer to different variables. How can sharing them under the same id n0 ever work in an implementation? What is stored in the memoisation table?

I don't know, right? :D I just wipe out memoization tables whenever I interpret terms inside a function. WIP. But let me amend the example to show the vague point that I have:

tfromList [tgather sh v (\i -> tLetInt n0 (i + e)), tgather sh u (\i -> tLetInt n0 (i + e))]

and let's assume i does not occur in e. Then if the language had lambda abstraction we could rewrite to a local sharing

let x i = i + e in tfromList [tgather sh v (\i -> tLetInt n0 (x i)), tgather sh u (\i -> tLetInt n0 (x i))]

and we could simplify or interpret or whatever the term e only once, not twice. But without lambda abstractions we can't express this with local sharing, while we could, potentially, extremely smartly detect and process e only once with global sharing. Though, of course, the right way to do this is to explicitly share e, so this again may be the case of not enough sharing exposed. Perhaps with enough sharing there is no discrepancy between local and global. And/or with your extra rules. TODO.

As far as I know, any valid global sharing can be expressed as local sharing, although shared expressions might move a long way up the tree if they're used in textually distant parts of the code.

I guess this moving up the tree involves lambda abstraction?

Yeah maybe it doesn't matter much in practice. But it's an easy simplification to make, and intuitively it shouldn't ever make things worse -- the cost of checking whether the RHS is a variable node should always be offset by saving a let frame at execution time.

Surely. I've since applied it to a couple more places and no deterioration and I bet there are evil examples where it actually improves stuff a lot.

but the best I can do is a huge string of lets before a term with no lets in it.

You're right that it seems difficult to do better than this; it definitely lessens the appeal of using "local" lets. Hm, I hoped it wouldn't be this bad.

Note that the pretty-printed terms are of this form and indeed, I can't see a way out. However, I haven't yet grokked your "transposition is Cayley-transformed" idea, so perhaps this works. Still, I can't see how forward pass (numeric classes for ADVal) can generate better local sharing, given that both the D datatype and the classes restrict what we can do. Perhaps just yet another drawback of dual numbers approach to AD.

Would it be better if we made D aware of Ast, i.e. if we made a special-purpose DOfAst that has proper understanding of lets? Something like this:

data DOfAst d a = DOfAst (Ast (a, Dual d a))

i.e. produce the pair late instead of early. This will allow lets to build up naturally inside the AST, but does require you to add pairs to the Ast language.

I hoped you wouldn't mention this ;). Our Tensor & Co are not that far from being able to express that, because we already have tD and tScale. But this is no longer dual numbers, you sneaky CHAD, you.

Instead of the current

  D u u' + D v v' = dD (u + v) (dAdd u' v')

we'd need

  u + v = tD (tprimaPart u + tPrimalPart v) (tAdd (tdualPart u) (tdualPart v'))

where tAdd is not yet added, but would work similarly as tScale, which is

  tScale :: KnownNat n => TensorOf n (Primal r) -> DualOf n r -> DualOf n r
...
instance (ADTensor (Ast0 r), ShowAstSimplify r)
         => Tensor (ADVal (Ast0 r)) where
  tScale = dScale
...           
instance ShowAstSimplify r
         => Tensor (Ast0 r) where
  tScale (AstPrimalPart s) (AstDualPart t) = AstDualPart $ s `tmult` t

I forgot if there's any fundamental typing or logic problem there or just

  1. awful verbosity
  2. less generality (the numeric classes no longer work for Double but only for Tensor and DOfAst no longer works for our first pipeline nor for the even simpler pipeline that goes from Tensor straight to ADVal Double --- I use all three pipelines for cross-testing)
Mikolaj commented 1 year ago

I've done a very bad thing. In order to pretty-print well, so that I can ask how to implement sharing, I've implemented this sharing method:

storing the let bindings as a third argument of the D constructor. But then all the code involving D suddenly becomes full of boilerplate (e.g., concatenating the lets lists from two D arguments of a function). And the end result is again a huge string of lets before a term with no lets in it, so the locality of local lets is lost anyway.

All global lets are ripped out, but this method is messy, currently 20% slower (80% slower in the Mnist fully connected nn test) and is asymptotically slower (probably quadratic instead of linear or log-linear; I hope not exponential this time) than global sharing in some cases that my large tests apparently don't trigger yet.

Mikolaj commented 1 year ago

Pretty-printing and sharing, continued. [Edited multiple times based on chat feedback.]

foo :: RealFloat a => (a, (a, a)) -> a
foo (x, (y, z)) =
  let w = x * sin y
  in atan2 z w + z * w

(1: naive human version) rev of foo from above. It doesn't fit the Tensor class (tuples are illegal) and so can't be compiled to GPU via Tensor and can't be (easily) used to build larger code with Tensor and then perhaps take rev again.

```hs revFooHuman :: r -> (r, (r, r)) -> (r, (r, r)) revFooHuman dret (x, (y, z)) = let x6 = sin y x7 = x * x6 x8 = recip (z * z + x7 * x7) x9 = sin y x10 = x * x9 x11 = z * dret x12 = negate (z * x8) * dret in ( x6 * x12 + x9 * x11 , ( cos y * (x * x12) + cos y * (x * x11) , (x7 * x8) * dret + x10 * dret ) ) ```

(2: automatically printed; the beautifying variant) this is generated from foo instantiated to Tensor class and it has the same interoperability problems as (1), except that the domain and codomain are flattened out. In principle, it's possible to recover the nested tuples (and other non-user defined datatypes) by applying the Adaptor tools to the function before pretty-printing it (but we'd no longer be pretty-printing Ast, but a new syntax made up for pretty-printing). In comparison to the non-simplified version (4), Haskell let is used in place of dlet and tlet from Tensor and a tuple is used instead of the DomainsOf vector. This improves readability but loses sharing (now represented only with Haskell lets that we can't trace), which could be subsequently recovered via stable names or otherwise.

```hs let revFooHorde :: TensorOf 1 r -- scalar arguments packed in tensor s0 -> TensorOf 0 r -- dret -> TensorOf 0 r -> TensorOf 0 r -> TensorOf 0 r -- tensor arguments -> (TensorOf 1 r, TensorOf 0 r, TensorOf 0 r, TensorOf 0 r) revFooHorde = -- automatically pretty-printed code starts here \s0 dret x y z -> -- could be written (x, y, z) for free, but not (x, (y, z)) let x6 = sin y x7 = x * x6 x8 = recip (z * z + x7 * x7) x9 = sin y x10 = x * x9 x11 = z * dret x12 = negate (z * x8) * dret in ( tfromList [] , x6 * x12 + x9 * x11 , cos y * (x * x12) + cos y * (x * x11) , (x7 * x8) * dret + x10 * dret ) ```

(3: adapted version) this shows how to ad-hoc adapt (2) to be almost as useful as (4). Sharing is lost but otherwise, this can easily be used for constructing new Tensor programs and for using Tensor tools, such as the future compilation to GPU. While we can adapt the resulting Haskell function to have the nested tuples (lists, Eithers, vectors, etc.) as in the objective function, I don't know how to adapt the gradient Ast in such a way, unless @tomjaguarpaw's biapplicative approach #35 makes it possible.

```hs revFoo :: TensorOf 1 r -- scalar arguments packed in tensor s0 -> TensorOf 0 r -- dret -> TensorOf 0 r -> TensorOf 0 r -> TensorOf 0 r -- tensor arguments -> DomainsOf r revFoo sI dretI xI yI zI = let revFooHorde :: TensorOf 1 r -- scalar arguments packed in tensor s0 -> TensorOf 0 r -- dret -> TensorOf 0 r -> TensorOf 0 r -> TensorOf 0 r -- tensor arguments -> (TensorOf 1 r, TensorOf 0 r, TensorOf 0 r, TensorOf 0 r) revFooHorde = -- automatically pretty-printed code starts here, copied from above, except for (x, y, z) \s0 dret (x, y, z) -> -- the tuple is a possible variant, for illustration let x6 = sin y x7 = x * x6 x8 = recip (z * z + x7 * x7) x9 = sin y x10 = x * x9 x11 = z * dret x12 = negate (z * x8) * dret in ( tfromList [] , x6 * x12 + x9 * x11 , cos y * (x * x12) + cos y * (x * x11) , (x7 * x8) * dret + x10 * dret ) -- automatically pretty-printed code ends here in toDomains'' $ gradient sI dretI (xI, yI, zI) ```

(4: automatically printed; the raw variant that exposes all sharing) rev of foo with sharing represented using Tensor. This can be compiled to GPU straight away (once we implement MLIR). Note that the curried domain is morally DomainsOf r, but it doesn't have to be represented so, because it's easy to automatically map the flattened sequence of arguments to the DomainsOf vector.

```hs let revFooHordeRaw :: TensorOf 1 r -- scalar arguments packed in tensor s0 -> TensorOf 0 r -- dret -> TensorOf 0 r -> TensorOf 0 r -> TensorOf 0 r -- tensor arguments -> DomainsOf r revFooHordeRaw = -- automatically pretty-printed code starts here \s0 dret x y z -> dlet (sin y) (\x6 -> dlet (x * x6) (\x7 -> dlet (recip (z * z + x7 * x7)) (\x8 -> dlet (sin y) (\x9 -> dlet (x * x9) (\x10 -> dlet (z * dret) (\x11 -> dlet (negate (z * x8) * dret) (\x12 -> dmkDomains (fromList [ dfromR (tfromList []) , dfromR (x6 * x12 + x9 * x11) , dfromR (cos y * (x * x12) + cos y * (x * x11)) , dfromR ((x7 * x8) * dret + x10 * dret) ])))))))) ```

This is foo with explicitly expressed sharing that survives the interpretations and passes, but requires instantiating foo to Tensor from the very start.

fooLet :: forall r n. (RealFloat (TensorOf n r), Tensor r, KnownNat n)
       => (TensorOf n r, (TensorOf n r, TensorOf n r)) -> TensorOf n r
fooLet (x, (y, z)) =
  let w0 = x * sin y
  in tlet w0 $ \w ->
     atan2 z w + z * w

rev of fooLet (instantiated to TensorOf 0 r) printed with sharing represented with Haskell lets

```hs \s0 dret x y z -> let x7 = sin y x8 = x * x7 x9 = recip (z * z + x8 * x8) x10 = negate (z * x9) * dret + z * dret in ( tfromList [] , x7 * x10 , cos y * (x * x10) , (x8 * x9) * dret + x8 * dret ) ```

Relu. [Edit: greatly simplified just as Tom guessed it should be.]

relu :: forall n r. (ADReady r, KnownNat n, Num (TensorOf n r)
     => TensorOf n r -> TensorOf n r
relu v =
  let oneIfGtZero = tmap0N (\x -> ifB (x <=* 0) 0.0 1.0) v
  in oneIfGtZero * v

https://github.com/Mikolaj/horde-ad/blob/cc5c6198197ca72cb92234a21ff79d6709684070/simplified/HordeAd/Core/TensorClass.hs#L296-L298

After we apply it to a variable of shape [3,4] (which probably can't be expressed in the user language without introducing special functions that take arguments only to compute their shapes, etc.), vectorization kicks in, first transforming the nested `tbuild1`, then the outer one. ```hs -- START of vectorization for term tbuild1 4 (\x5 -> tconstant (tfromList [tconst 0.0, tconst 1.0] ! [ifB (m3 ! [i4, i5] <=* tconst 0.0) 0 1])) -- END of vectorization yields tconstant (tgather [4] (tconst (fromList [2] [0.0, 1.0])) (\[i5] -> [ifB (m3 ! [i4, i5] <=* tconst 0.0) 0 1])) -- START of vectorization for term tbuild1 3 (\v4 -> tconstant (tgather [4] (tconst (fromList [2] [0.0, 1.0])) (\[i5] -> [ifB (m3 ! [i4, i5] <=* tconst 0.0) 0 1]))) -- END of vectorization yields tconstant (tgather [3, 4] (tconst (fromList [2] [0.0, 1.0])) (\[i6, i5] -> [ifB (m3 ! [i6, i5] <=* tconst 0.0) 0 1])) ``` The primal part function, after applying to a variable of shape [3,4], vectorization and the forward pass. (Apparently the variable name counter had different state than when vectorization log has been captured.) ```hs \s0 m3 -> let m9 = tgather [3,4] (tconst (fromList [2] [0.0,1.0])) (\[i7, i8] -> [ifB (m3 ! [i7, i8] <=* tconst 0.0) 0 1]) in m9 * m3 ``` It's dual part, where `AstVarId 9` is `m9`. ```hs LetR 3 (ScaleR (AstVar [3,4] (AstVarId 9)) (InputR (InputId 0))) ``` Its gradient function. ```hs \s0 dret m3 -> let m9 = tgather [3,4] (tconst (fromList [2] [0.0,1.0])) (\[i7, i8] -> [ifB (m3 ! [i7, i8] <=* tconst 0.0) 0 1]) in (tfromList [], m9 * dret) ```

Relu of multiplication by a Double scalar.

reluT2 :: (TensorOf 1 (Ast0 Double), Ast0 Double)
       -> TensorOf 1 (Ast0 Double)
reluT2 (t, r) = relu (t * tkonst 5 (tscalar r))
After we apply it to a variable of shape `[5]` (for simplicity, since it generates 40 characters shorter gradient Ast than for `[3,4]`), vectorization kicks in. ```hs -- START of vectorization for term tbuild1 5 (\x4 -> tconstant (tfromList [tconst 0.0, tconst 1.0] ! [ifB ((v3 * tkonst 5 (s0 ! [0])) ! [i4] <=* tconst 0.0) 0 1])) -- END of vectorization yields tconstant (tgather [5] (tconst (fromList [2] [0.0, 1.0])) (\[i4] -> [ifB (v3 ! [i4] * s0 ! [0] <=* tconst 0.0) 0 1])) ``` The primal part function, after applying to a variable of shape [5], vectorization and the forward pass. ```hs \s0 v3 -> let v6 = tkonst 5 (s0 ! [0]) v7 = tgather [5] (tconst (fromList [2] [0.0, 1.0])) (\[i5] -> [ ifB ((let x11 = v3 ! [i5] x12 = s0 ! [0] in x11 * x12) <=* tconst 0.0) 0 1 ]) v8 = v3 * v6 in v7 * v8 ``` It's dual part. ```hs LetR 10 (ScaleR (AstVar [5] (AstVarId 7)) (LetR 9 (AddR (ScaleR (AstVar [5] (AstVarId 6)) (InputR (InputId 0))) (ScaleR (AstVar [5] (AstVarId 3)) (LetR 8 (KonstR 5 (LetR 7 (IndexZ (LetR 6 (FromVectorR [ScalarR (Input0 (InputId 0))])) [AstIntConst 0] [1])))))))) ``` Its gradient function, which looks misleading, because the `tscatter` does not arise as a transpose of `tgather`, as would be expected, but is the transpose of indexing (which in turn is automatically inserted while packing all scalars into a single tensor) and is doubly trivial in this case and should be eliminated, leaving in the final line just `(tsum (x3 * x9), x6 * x9)`. BTW, note that some sharing (and its variables) come from the forward pass (these appear in the primal part function as well) while other sharing appears as late as in the transpose pass (`v9`, `v10`). ```hs \s0 dret v3 -> let v6 = tkonst 5 (s0 ! [0]) v7 = tgather [5] (tconst (fromList [2] [0.0, 1.0])) (\[i5] -> [ ifB ((let x11 = v3 ! [i5] x12 = s0 ! [0] in x11 * x12) <=* tconst 0.0) 0 1 ]) v8 = v3 * v6 v9 = v7 * dret v10 = tscatter [1] (tsum (v3 * v9)) (\[] -> [0]) in (tfromList [tconst 0.0 + v10 ! [0]], v6 * v9) ``` Fortunately our Ast simplification handles this misleading trivial `tscatter` case, producing the following. ```hs \s0 dret v3 -> let v9 = tconstant (tgather [5] (tconst (fromList [2] [0.0, 1.0])) (\[i5] -> [ifB (v3 ! [i5] * s0 ! [0] <=* tconst 0.0) 0 1])) * dret in (tkonst 1 (tconst 0.0 + tsum (v3 * v9)), tkonst 5 (s0 ! [0]) * v9) ```

The same with relu implemented using maxB. I haven't benchmarked, but it looks like a complete disaster at least allocation-wise. If we had an Ast term for maxB, it would probably behave better, at the cost of extra vectorization, simplification, forward pass and transpose rules.

reluMax :: forall n r. (ADReady r, KnownNat n)
        => TensorOf n r -> TensorOf n r
reluMax v = tmap0N (maxB 0) v

https://github.com/Mikolaj/horde-ad/blob/cc5c6198197ca72cb92234a21ff79d6709684070/simplified/HordeAd/Core/TensorClass.hs#L296-L298

https://github.com/conal/Boolean/blob/f218d5ecee207d04e096cfa0e39276f5c07d7299/src/Data/Boolean.hs#L123-L125

After we apply it to a variable of shape [3,4], vectorization kicks in, first transforming the nested `tbuild1`, then the outer one. ```hs -- START of vectorization for term tbuild1 4 (\x5 -> tfromList [tconst 0.0, m3 ! [i4, i5]] ! [ifB (tconst 0.0 >=* m3 ! [i4, i5]) 0 1]) -- END of vectorization yields tgather [4] (tfromList [tconstant (tkonst 4 (tconst 0.0)), m3 ! [i4]]) (\[i6] -> [ifB (tconst 0.0 >=* m3 ! [i4, i6]) 0 1, i6]) -- START of vectorization for term tbuild1 3 (\v4 -> tgather [4] (tfromList [tconstant (tkonst 4 (tconst 0.0)), m3 ! [i4]]) (\[i6] -> [ifB (tconst 0.0 >=* m3 ! [i4, i6]) 0 1, i6])) -- END of vectorization yields tgather [3, 4] (tfromList [tconstant (tkonst 3 (tkonst 4 (tconst 0.0))), m3]) (\[i7, i6] -> [ifB (tconst 0.0 >=* m3 ! [i7, i6]) 0 1, i7, i6]) ``` primal ```hs \s0 m3 -> tgather [3, 4] (tfromList [tkonst 3 (tkonst 4 (tconst 0.0)), m3]) (\[i8, i9] -> [ifB (tconst 0.0 >=* m3 ! [i8, i9]) 0 1, i8, i9]) ``` dual ```hs LetR 5 (GatherZ [3,4] (LetR 4 (FromListR [ZeroR,InputR (InputId 0)])) [2,3,4]) ``` gradient ```hs \s0 dret m3 -> let t12 = tscatter [2, 3, 4] dret (\[i10, i11] -> [ifB (tconst 0.0 >=* m3 ! [i10, i11]) 0 1, i10, i11]) in (tfromList [], t12 ! [1]) ``` For comparison, here's the gradient Ast for the fast implementation of `relu` copied from above. It uses the relatively fast `tgather` operation materializing a tensor of shape [3, 4] instead of `tscatter` materializing a tensor of shape [2, 3, 4] (which gets promptly projected, leaving only its second column). ```hs \s0 dret m3 -> let m9 = tgather [3,4] (tconst (fromList [2] [0.0,1.0])) (\[i7, i8] -> [ifB (m3 ! [i7, i8] <=* tconst 0.0) 0 1]) in (tfromList [], m9 * dret) ```

Relu (using maxB) of multiplication by a Double scalar, as before.

reluMaxT2 :: (TensorOf 1 (Ast0 Double), Ast0 Double)
          -> TensorOf 1 (Ast0 Double)
reluMaxT2 (t, r) = reluMax (t * tkonst 5 (tscalar r))
After we apply it to a variable of shape `[5]`, vectorization kicks in. ```hs -- START of vectorization for term tbuild1 5 (\x4 -> tfromList [tconst 0.0, (v3 * tkonst 5 (s0 ! [0])) ! [i4]] ! [ifB (tconst 0.0 >=* (v3 * tkonst 5 (s0 ! [0])) ! [i4]) 0 1]) -- END of vectorization yields tgather [5] (tfromList [tconstant (tkonst 5 (tconst 0.0)), v3 * tkonst 5 (s0 ! [0])]) (\[i5] -> [ifB (tconst 0.0 >=* v3 ! [i5] * s0 ! [0]) 0 1, i5]) ``` primal ```hs \s0 v3 -> let v6 = tkonst 5 (s0 ! [0]) in tgather [5] (tfromList [tkonst 5 (tconst 0.0), v3 * v6]) (\[i7] -> [ ifB (tconst 0.0 >=* (let x14 = v3 ! [i7] x15 = s0 ! [0] in x14 * x15)) 0 1 , i7 ]) ``` dual ```hs LetR 14 (GatherZ [5] (LetR 13 (FromListR [ZeroR,LetR 12 (AddR (ScaleR (AstVar [5] (AstVarId 6)) (InputR (InputId 0))) (ScaleR (AstVar [5] (AstVarId 3)) (LetR 11 (KonstR 5 (LetR 10 (IndexZ (LetR 9 (FromVectorR [ScalarR (Input0 (InputId 0))])) [AstIntConst 0] [1]))))))])) [2,5]) ``` gradient ```hs \s0 dret v3 -> let v6 = tkonst 5 (s0 ! [0]) m11 = tscatter [2, 5] dret (\[i8] -> [ ifB (tconst 0.0 >=* (let x9 = v3 ! [i8] x10 = s0 ! [0] in x9 * x10)) 0 1 , i8 ]) v12 = m11 ! [1] v13 = tscatter [1] (tsum (v3 * v12)) (\[] -> [0]) in (tfromList [tconst 0.0 + v13 ! [0]], v6 * v12) ``` Our Ast simplification pass can deal with the trivial `tscatter`, but not the non-trivial one. ```hs \s0 dret v3 -> let v12 = tscatter [2, 5] dret (\[i8] -> [ifB (tconst 0.0 >=* v3 ! [i8] * s0 ! [0]) 0 1, i8]) ! [1] in (tkonst 1 (tconst 0.0 + tsum (v3 * v12)), tkonst 5 (s0 ! [0]) * v12) ``` For comparison, here's the gradient Ast for the fast implementation of `relu` copied from above. Again, it uses the relatively fast `tgather` operation materializing a tensor of shape [5] instead of `tscatter` materializing a tensor of shape [2, 5] (which gets promptly projected, leaving only its second column). ```hs \s0 dret v3 -> let v9 = tconstant (tgather [5] (tconst (fromList [2] [0.0, 1.0])) (\[i5] -> [ifB (v3 ! [i5] * s0 ! [0] <=* tconst 0.0) 0 1])) * dret in (tkonst 1 (tconst 0.0 + tsum (v3 * v9)), tkonst 5 (s0 ! [0]) * v9) ```
tomsmeding commented 1 year ago

As far as I know, any valid global sharing can be expressed as local sharing, although shared expressions might move a long way up the tree if they're used in textually distant parts of the code.

I guess this moving up the tree involves lambda abstraction?

I'm not sure what you mean with this. If you have lambda abstractions in your trees, the moving upwards might cross lambda abstractions, but it's still just moving the expression higher up the tree. E.g.

bla (bla (n + tletR1 e) bla) foo (foo (tletR1 e + bar) foo)
~>
let x = e
in bla (bla (n + x) bla) foo (foo (x + bar) foo)

in which e has moved up through multiple AST nodes to settle in its final place. But I believe you already understand this, given your notes about putting the bindings up all the way in D, which is what you need to do when you want to express sharing between two separate trees. Which is apparently not great.

eval is cayley-transformed

TODO think about this

I hoped you wouldn't mention this ;). Our Tensor & Co are not that far from being able to express that, because we already have tD and tScale. But this is no longer dual numbers, you sneaky CHAD, you.

Yes lol, you're right. Err, yeah. ¯\_(ツ)_/¯

All global lets are ripped out, but this method is messy, currently 20% slower (80% slower in the Mnist fully connected nn test) and is asymptotically slower (probably quadratic instead of linear or log-linear; I hope not exponential this time) than global sharing in some cases that my large tests apparently don't trigger yet.

:D :(


About your pretty-printed story:

(1: human version) rev of foo from above.

My hand-written, systematic version looks like this:

```hs revFooTom :: r -> (r, (r, r)) -> (r, (r, r)) revFooTom dt (x, (y, z)) = let s = sin y w = x * s a = atan2 z w b = z * w c = a + b dc = dt da = dc db = dc dz1 = da * w / (z * z + w * w) dw1 = da * -z / (z * z + w * w) dz2 = db * w dw2 = db * z dw = dw1 + dw2 dz = dz1 + dz2 dx = dw * s ds = dw * x dy = ds * cos y in (dx, (dy, dz)) ```

which simplifies to this, if you inline definitions if they either just rebind a name, or are used at most once:

```hs revFooTom :: r -> (r, (r, r)) -> (r, (r, r)) revFooTom dt (x, (y, z)) = let s = sin y w = x * s dw = dt * -z / (z * z + w * w) + dt * z in ( dw * s , ( (dw * x) * cos y , dt * w / (z * z + w * w) + dt * w ) ) ```

Why do you compute sin y and w twice in your hand-written code? Oh, it's because the user did not annotate sharing, right? Blegh. This would just work in Accelerate. [oh, you discuss this further below as well]

Also, your (4) version includes the snippet negate (z * (tconst 1.0 / (z * z + (z * z + x7 * x7)))), which is surely not correct? Where does the 2z^2 come from?

Re your relu examples, what are the lessons we should draw from this? It's good to have examples, but the only thing I'm lifting from it is that 1. scatter with invertible indexing function needs to be simplified (perhaps to a gather), and 2. it would really be nice if we can do some fusion after AD again so that snippets like

tgather [5] (tconst (fromList [2] [0.0,1.0]))
            (\[i5] -> [ifB ((let x12 = s0 ! [0] in
                             let x13 = x3 ! [i5]
                             in x13 * x12) <=* tconst 0.0) 0 1])

don't persist into code generation but instead become something like this:

ifB (((x3 ! [i5]) * (s0 ! [0])) <=* tconst 0.0)
  (tkonst 5 (tscalar 0.0))
  (tkonst 5 (tscalar 1.0))

and subsequently this:

tkonst 5 (tscalar
  (ifB (((x3 ! [i5]) * (s0 ! [0])) <=* tconst 0.0)
     0.0 1.0))

or something.

Mikolaj commented 1 year ago

I guess this moving up the tree involves lambda abstraction?

I'm not sure what you mean with this.

I think we've already discussed that on the call a week ago? I've shown you these snippets:

tfromList [tgather sh v (\i -> tLetInt n0 (i + e)), tgather sh u (\i -> tLetInt n0 (i + e))]
let x i = i + e in tfromList [tgather sh v (\i -> tLetInt n0 (x i)), tgather sh u (\i -> tLetInt n0 (x i))]

and we agreed the real problem here was too little exposed sharing and that with enough sharing lambda-abstraction is not necessary to lift sharing up the tree.

If you have lambda abstractions in your trees, the moving upwards might cross lambda abstractions, but it's still just moving the expression higher up the tree. [...]

Well, sure, this is simple as long as the shared expression does not contain the lambda-abstracted variables, right? But I think we talked about that.

eval is cayley-transformed

TODO think about this

100

OTOH, there is a merit in leaving delta evaluation as it is for now and mentioning in the paper "it's the same as in the POPL paper, only extended and expresses in the language of Tensor". Which is almost true and actually a nice property of the symbolic execution trick: we really use the same delta handling as for the concrete floats.

In the long term, surely reworking this would give us new insight and the (relatively small) new sharing introduced by delta evaluation could perhaps be inserted locally, not at the top level, where the large amount of sharing coming from the forward pass has to reside (if we stick to simple dual number representation).

which simplifies to this, if you inline definitions if they either just rebind a name, or are used at most once:

revFooTom :: r -> (r, (r, r)) -> (r, (r, r))
revFooTom dt (x, (y, z)) =
  let s = sin y
      w = x * s
      dw = dt * -z / (z * z + w * w) + dt * z
  in ( dw * s
     , ( (dw * x) * cos y
       , dt * w / (z * z + w * w) + dt * w
       )
     )

Perfect. This is equal to the snippet shown below "rev of fooLet (instantiated to TensorOf 0 r) printed with loss of sharing."

Why do you compute sin y and w twice in your hand-written code? Oh, it's because the user did not annotate sharing, right? Blegh. This would just work in Accelerate. [oh, you discuss this further below as well]

This is interesting. I'd like it to work without tlet in horde-ad as well. How does Accelerate do it?

Also, your (4) version includes the snippet negate (z * (tconst 1.0 / (z * z + (z * z + x7 * x7)))), which is surely not correct? Where does the 2z^2 come from?

Well spotted. I've corrected it now. My mistake when manually formatting and, at the same time, tweaking pretty-printing and updating the results.

Re your relu examples, what are the lessons we should draw from this?

I'm sure @awf will draw some lessons. Myself I'm curious what's the most efficient way to define relu. I'm demonstrating only two ways and I think the first (multiplication by a 0/1 vector) is clearly superior to the second (mapping max), but I'm a newbie.

It's good to have examples, but the only thing I'm lifting from it is that 1. scatter with invertible indexing function needs to be simplified (perhaps to a gather),

Me, similarly, but I wanted to introduce tupdate for that purpose in #101 and basically copy how tgather is fused and simplified ATM (using tindex in a similar role that, I'm guessing, tupdate would play).

and 2. it would really be nice if we can do some fusion after AD again

I do apply the full simplification pass and show the result below the remark "Unfortunately, at this time we don't have any scatter simplification rules", but the result is very unsatisfying.

so that snippets like

tgather [5] (tconst (fromList [2] [0.0,1.0]))
            (\[i5] -> [ifB ((let x12 = s0 ! [0] in
                             let x13 = x3 ! [i5]
                             in x13 * x12) <=* tconst 0.0) 0 1])

don't persist into code generation but instead become something like this:

ifB (((x3 ! [i5]) * (s0 ! [0])) <=* tconst 0.0)
  (tkonst 5 (tscalar 0.0))
  (tkonst 5 (tscalar 1.0))

Unfortuantely we don't rewrite gathers into conditionals, even just because we don't have (float) conditionals in the syntax (also, sadly i5 variable is dangling, so this can't be simplified so, but there are case where it can and I can only hope the current tgather simplification is complete and handles it --- do you have an example I could check?).

and subsequently this:

tkonst 5 (tscalar
  (ifB (((x3 ! [i5]) * (s0 ! [0])) <=* tconst 0.0)
     0.0 1.0))

or something.

This is a pretty cool simplification, but it's non-local, which makes it costly --- you have to look at the two branches and ensure they have the same tree prefix (or that they can be rewritten, perhaps complicating them, to have the same prefix).

tomsmeding commented 1 year ago

Well, sure, this is simple as long as the shared expression does not contain the lambda-abstracted variables, right? But I think we talked about that.

Oh that's what you meant. Yes, no please don't introduce lambdas while lifting. Indeed we talked about this, and the answer was that things become really annoying with scoped global sharing identifiers. :D (At least until you eliminate them and make them local sharing, if at all)

Why do you compute sin y and w twice in your hand-written code? Oh, it's because the user did not annotate sharing, right? Blegh. This would just work in Accelerate. [oh, you discuss this further below as well]

This is interesting. I'd like it to work without tlet in horde-ad as well. How does Accelerate do it?

Stable names to get global sharing, then the typed sharing recovery algorithm to convert to local sharing. Accelerate literature calls this whole procedure "sharing recovery" instead of just the second half.

Unfortuantely we don't rewrite gathers into conditionals, even just because we don't have (float) conditionals in the syntax (also, sadly i5 variable is dangling, so this can't be simplified so, but there are case where it can and I can only hope the current tgather simplification is complete and handles it --- do you have an example I could check?).

Oh oops you're right, that was an invalid transformation. Still the intent was not to turn gather into conditionals but to commute the ifB and the gather, assuming (incorrectly) that the condition was independent of the index variable.

This is a pretty cool simplification, but it's non-local, which makes it costly --- you have to look at the two branches and ensure they have the same tree prefix (or that they can be rewritten, perhaps complicating them, to have the same prefix).

Yeah, same commutation operation, just the other way round.

In any case the same snippet would be more neatly written as:

tbuild [5] (\[i5] -> ifB (x3![i5] * s0![0] <=* tconst 0.0) 0.0 1.0)

which is indeed a build, but its body doesn't really have branching (interpreting the ifB as a select, as I think you already do, is perfectly valid and even the optimal choice) do it's vectorisable. You just don't want to vectorise it to materialised tkonst 5 0.0 and tkonst 5 1.0 vectors, you want some sort of non-materialised intermediate array. I'm not sure how we want to design this. Accelerate has a delayed combinator in its post-fusion AST with the same signature as build (generate in Accelerate lingo) that has the same semantics as build, but operationally recomputes its body whenever it is indexed into. Such a delayed combinator is, I believe, not admissible as an argument to every other combinator, but it works with most.

Mikolaj commented 1 year ago

All global lets are ripped out, but this method is messy, currently 20% slower (80% slower in the Mnist fully connected nn test) and is asymptotically slower (probably quadratic instead of linear or log-linear

I've eliminated the 20% and 80% slowdown at the cost of some extra hacks (and one more Ast constructor). The quadratic complexity is still there in pathological cases, e.g., when the API user takes Ast primal parts of dual numbers and embeds them in new dual numbers an enormous number of times. So perhaps the terrible top-level-local sharing may stay for now, while we are still taking over the world and so our resources are limited.

Yes, no please don't introduce lambdas while lifting. Indeed we talked about this, and the answer was that things become really annoying with scoped global sharing identifiers. :D (At least until you eliminate them and make them local sharing, if at all)

Done. :)

Stable names to get global sharing, then the typed sharing recovery algorithm to convert to local sharing. Accelerate literature calls this whole procedure "sharing recovery" instead of just the second half.

Got it. Now we only need a volunteer for #102. This is pretty standalone, I think.

In any case the same snippet would be more neatly written as:

tbuild [5] (\[i5] -> ifB (x3![i5] * s0![0] <=* tconst 0.0) 0.0 1.0)

which is indeed a build, but its body doesn't really have branching (interpreting the ifB as a select, as I think you already do, is perfectly valid and even the optimal choice) do it's vectorisable. You just don't want to vectorise it to materialised tkonst 5 0.0 and tkonst 5 1.0 vectors, you want some sort of non-materialised intermediate array. I'm not sure how we want to design this. Accelerate has a delayed combinator in its post-fusion AST with the same signature as build (generate in Accelerate lingo) that has the same semantics as build, but operationally recomputes its body whenever it is indexed into. Such a delayed combinator is, I believe, not admissible as an argument to every other combinator, but it works with most.

This is complex, but makes sense. I've seen that massive also has such intermediate arrays (but hmatrix doesn't IIRC). We could start by naming this idiom in a universally understandable way, vectorizing build to this new form when appropriate, but rewriting to or interpreting via a silly gather for now. We'd add a comment it can be optimized via delayed arrays in the future.