Mikolaj / horde-ad

Higher Order Reverse Derivatives Efficiently - Automatic Differentiation library based on the paper "Provably correct, asymptotically efficient, higher-order reverse-mode automatic differentiation"
BSD 3-Clause "New" or "Revised" License
33 stars 6 forks source link

[Outdated] Discussion about the wiki page Extending Delta to primitive Vector type #4

Closed Mikolaj closed 1 year ago

Mikolaj commented 2 years ago

Let's discuss https://github.com/Mikolaj/horde-ad/wiki/Extending-Delta-to-primitive-Vector-type here.

Other discussions and the meta discussion about how to discuss are at https://github.com/Mikolaj/mostly-harmless/discussions

Mikolaj commented 2 years ago

Let's move the discussion from the wiki discussion area here. First my long monologue:

Mikolaj says: as a user of Delta that implements a simple MNIST neural network, I need to be able to express (the gradients of) the following machinery (I mention only the operations on vectors, there are a couple more on matrices or matrices and vectors).

Cross-entropy function that takes vectors res and targ to

-(log res <.> targ)

where (<.>) is dot product.

Soft-max activation function that takes vector x to

konst (1 / sumElements (exp x)) k * (exp x)

where konst r k produces a vector of k elements r and sumElements is exactly what you'd expect (the names are from the hmatrix library).

The gradient of konst is something close to sumElements and vice versa and the gradient of (<.>) may be a sum of scaled gradients of its arguments, but I'm not clear by what we are scaling (e.g., by sum of elements of the argument vectors)?

Mikolaj says: Oh, actually the derivation at the top of this page clearly shows the gradients of the vectors are scaled by the very argument vectors, as in my abandoned implementation on branch hTensor:

(<.>!) :: Coord r
       => DualDelta (Array r) -> DualDelta (Array r) -> DualDelta (Array r)
(<.>!) (D u u') (D v v') =
  let uV = asVector u
      vV = asVector v
  in D (Util.scalar $ uV <.> vV) (Add (Scale v u')  -- probably too naive
                                      (Scale u v'))

Mikolaj says: Since sumElements can be expressed as a dot product with a vector containing only ones and konst can be expressed by scaling such a vector with a scalar and since we can provide such a vector (probably one for each length? that makes it problematic) even just as a dummy variable, we only need one more operation to cover our vector needs: scaling a vector by a scalar. That's the simplest example of an operation between entities with different rank. However, in this case we can avoid any such operation by providing konst primitive instead. Is there any trick to avoid providing both the operation and konst?

Mikolaj says: Actually, Dot as defined above wouldn't typecheck with my code from branch hTensor, because it operates on a "vector of deltas", while what we require for performance is "delta of vectors", that is, delta expression that represents the gradient of an operation on vector(s). Citing section 8.3: "another alternative might be to treat Vector R (and perhaps multi-dimensional versions) as new, primitive, differentiable data types alongside R".

Mikolaj says: After switching from

  | Dot (Data.Vector.Vector r) (Data.Vector.Vector (Delta r))

to

  | Dot (Data.Vector.Vector r) (Delta (Data.Vector.Vector r))

I've got the following to typecheck

(<.>!) :: DualDelta (Data.Vector.Vector r)
       -> DualDelta (Data.Vector.Vector r)
       -> DualDelta r
(<.>!) (D u u') (D v v') = D (u <.> v) (Add (Dot v u') (Dot u v'))

But I haven't changed the sketch at the top, because perhaps we want to finish defining the easier (?) "vector of deltas" first.

Mikolaj says: And with "delta of vectors", defining evalV is suspiciously easy (and the whole library typechecks fine). What remains is to decide where to put, in the array resulting from the evaluation of delta, the parameters of vector type (and of vector of vectors type, etc.):

https://github.com/Mikolaj/horde-ad/blob/77fb8dc5dc121bc8390bd59d0f3a3de55943f42e/src/HordeAd/Delta.hs#L70-L86

Mikolaj says: I think the vectors of vectors of vectors is not a good idea. It's neither convenient (matrices are), nor fast (such general vectors can't be unboxed) nor really general (what about sums, tuples, lists, Traversable, other kinds of vectors, sparse matrices, etc.). So I've settled on only scalars and unboxed vectors of scalars, which simplified things a lot. The bad outcome is that Dot on the level of vectors has to error out (but it can't be expressed by the user without abusing Unbox instances, so this is safe enough). I will work out implementation details, typecheck, test, benchmark and see if we missed anything.

If needed, instead of paramtererized Delta, we can have two copies of monomorphic Delta, one for scalars, one for vectors, which would permit omission of the nonsensical Dot. But that would also require two copies of the dual numbers type and of the Num and other operations on it, etc. Not worth it, I guess.

Mikolaj says: I now think what I've written above

(<.>!) :: DualDelta (Data.Vector.Vector r)
       -> DualDelta (Data.Vector.Vector r)
       -> DualDelta r
(<.>!) (D u u') (D v v') = D (u <.> v) (Add (Dot v u') (Dot u v'))

is wrong, because types don't match. This would be fine for a derivative, but we need gradients and the result type of the gradient should be a vector, not a scalar, because the domain of the function is not scalar. And Dot produces scalars, so it can't work.

The u' component of the dual number should be a scalar infinitesimal, because that's what we get from the result of the function we differentiate and we calculate by how much we should wiggle the vectors to perturb the scalar result as much as what we got in u'. And the answer is some kind of multiplication of our input vector u by the scalar u', resulting in a vector. I don't think we have enough tools in Delta to do that multiplication yet. Let me point that out in the main text.

Mikolaj says: No, I think I panicked. We define forward derivatives, but then compute through them backwards. So it's fine that types of individual derivatives defined for operators, such as dot product, are not right for a gradient. Here's the text I removed from the derivation section, which I now hope is wrong:

However, this is a forward derivative and we need a gradient (the result type should be a vector, because the domain of dot product is not scalar, see discussion below). For the gradient, we need something else than Dot, some other kind of scaling that produces a vector (in the Delta component of the dual number) from a scalar (in the Delta realm) and a vector (in the constants/scalars realm). But Dot will be useful as well later on, at least to compute the sum of elements of a vector in the Delta realm, for the derivative of konst, if I'm not mistaken.

Mikolaj says: I've implemented two more vector operations needed for MNIST (now probably only matrix-vector multiplication remains, the one that implements application of a linear transformation represented by the matrix to the vector):

konst' :: Numeric r
       => DualDelta r -> Int -> DualDelta (Data.Vector.Storable.Vector r)
konst' (D u u') n = D (konst u n) (Konst u' n)

sumElements' :: Numeric r
             => DualDelta (Data.Vector.Storable.Vector r) -> DualDelta r
sumElements' (D u u') = D (sumElements u) (Dot (konst 1 (V.length u)) u')

These are quite likely wrong, because I didn't derive them in any principled way and also because I operate under the premise that the user defines forward derivatives (not gradients) in the second field of dual numbers, which may be contrived. To define the derivatives, I needed one more Delta constructor called Konst, which I'm appending to the top of the document together with its signature, for future work.

Now our functions take a pair --- an unboxed vector of scalar parameters and a normal vector of vector parameters. Similarly, the returned gradient is of this form. I haven't tested the code and I will probably need the matrix operation (and the engine extended to a third tuple component, a vector of matrices) to test it properly. Performance is down 10%, 5% of which may be recoverable (but I count on 30 times speedup once we are done extending Delta).

Mikolaj says: Preliminary tests of the delta-vector functionality: I've introduced a single dot product to one of three layers of the MNIST neural network (the first and widest hidden level, so most of the parameters went from the scalar component to the vector component), keeping everything else the same (I can't introduce more without adding matrix deltas or a Delta operation that transforms a vector of deltas into a delta of vector) and it runs and a preliminary benchmark is 4 times faster.

Mikolaj says: I've implemented one more vector operation needed for MNIST and to do that I needed one more Delta constructor called Seq, which I'm appending to the top of the document together with its signature, for future work. Here's the operation:

deltaSeq :: Numeric r
         => Data.Vector.Vector (DualDelta r)
         -> DualDelta (Data.Vector.Storable.Vector r)
deltaSeq v = D (V.convert $ V.map (\(D u _) -> u) v)  -- I hope this fuses
               (Seq $ V.map (\(D _ u') -> u') v)

I didn't derive it in any principled way, but I've rewritten with it the fully connected MNIST neural net avoiding even a single operation on scalars and it trains well, so I think it's correct. It also puts me at ease that the the premise that the user defines forward derivatives (and not gradients) in the second field of dual numbers is in fact correct.

Preliminary benchmarks of the fully vector-based MNIST neural network indicate it's 10 times faster than the same network operating exclusively on scalars (in fact, we can benchmark both, side-by-side, using the same engine and a common toolbox or even construct the sum of the scalar-based and vector-based networks and train the amalgam together). Once we introduce matrix operations to Delta, I think we will land somewhere between 20 times faster (backprop library result; we use the same hmatrix bindings to C libraries to make the comparison meaningful) and 30 times faster (completely handwritten gradient). At the same time, we maintain competitive performance for scalars (the degradation on worst examples (prod) yet to be measured and perhaps mitigated) and once we get to matrices and gradients, they should be close to zero-cost when not used (e.g., when only vectors are used).

Mikolaj commented 2 years ago

And here are Simon's remarks:

Simon PJ. I'm very lost. Your can write faster than I can read :-).

Here is what I understand.

  1. Answer part 1. Change the D[] translation on types, as shown in Section 8.3 of the paper, specifically

      D[Vector R] = Vector R * Vector Delta
      D[Vector A] = Vector D[A]     if A /= R

    That does not, by itself, imply a change in Delta, does it?

  2. Answer part 2. For each operation on vectors we need to write their D versions. E.g

    index_A = (Vector A * N) -> A

    (Remember we imagine having a type-indexed family of index operations, one for each type A. So we have

    D[ index_R ] :: D[ (Vector R * N) -> R ]
               :: ((Vector R * Vector Delta) * N) -> M (R * Delta)

    So I guess we can define D[index_R] thus:

    D[index_R] ((arr, darr), i) = return (index arr i, index darr i)
  3. Answer part 3. Now we have to do build. We have

    D[ build_R ] :: D[ (N * (N -> R)) -> Vector R ]
               :: (N * (N -> M (R * Delta))) -> M (Vector R * Vector Delta)

    So I guess we define it like this

    D[ build_R ] (n, f)
    = ... hmm...

    We want to build two vectors at once, and do so monadically. Maybe we need a new primitive

    buildM2 :: (N * (N -> M (a,b))) -> M (Vector a, Vector b)

Until these questions are settled, I cant' begin to understand the implementation details.

Mikolaj commented 2 years ago
  • We already have an implementatation that works with vectors. The purpose of this discussion is to make vectors more efficient.

Yes. However, we don't have an implementation that works on matrices and the outcome of extending Delta, etc., will be also that matrix implementation will become possible. That should also pave the way to arbitrary tensors and frame the work on UX (which is currently rather lacking even for scalars).

  • Question: what changes would we make to the paper (final POPL version here), to account for more-efficient vectors?

I will answer in detail in a later comment, but the crucial point is what I cited above from section 8.3: "another alternative might be to treat Vector R (and perhaps multi-dimensional versions) as new, primitive, differentiable data types alongside R". This is what lets us reduce allocation and consequently obtain a huge speedup in deriving gradients (compared to a minor speedup in gradient and value computation from being able to use native vector or matrix operations).

  • Question: is that enough to improve efficiency? I suspect not: we'll still generate a Delta with element-wise operations. Maybe we need to add new primitives for dot-product, vector addition, etc, and use them instead of build? Is build inherently inimical to efficiency here? (Sad if so; going from one primitive to dozens is unattractive.)

I think we need large scale primitive types, not primitive operations. The biggest overhead comes from storing 10^8 delta terms instead of a single Delta term with a vector or matrix (phantom) type. This is true even if the only operation we use in our program is, e.g., addition (working as well on scalars as on matrices and not requiring an extension of Delta). Sadly, once we add primitive types, we also require some operations to handle them, but there's hope we only need a few. Citing from the wiki document (please ignore the technical details):

Similarly, a vector of vectors is not the same as a matrix and, especially, a vector of deltas of vectors is terribly more costly than a single delta of a matrix. We have no guarantees that matrices are close to square, so the vector of vectors representation of a matrix may sometimes have O(n) elements (not O(sqrt(n)), in which case, assuming that a program consists of a constant number of matrix (of size n) operations, a vector of deltas of vectors using Dot incurs O(n) space and time overhead, instead of O(1) for a delta of matrix using DotL).

simonpj commented 2 years ago

However, we don't have an implementation that works on matrices

But if we haeve an implementation that works on vectors then we have one that works on matrices -- just a vector of vectors. No? (It might not be very efficient.)

another alternative might be to treat Vector R (and perhaps multi-dimensional versions) as new, primitive, differentiable data types alongside R".

Aha. So we have two plans:

  1. Vector is parametric in its argument, so we have Vector Float, Vector (Vector Float) etc; but we have special type-transation rules for Vector R, as in 8.3.
  2. Vector-of-real is a new primitive type, call it VectorF , in addition to Vector a.

I think you are advocating for (2) but I don't think we have any evidence for (1) vs (2) yet. Let's work out the consequences of both.

For (2) we should set out:

Mikolaj commented 2 years ago

Aha. So we have two plans:

  1. Vector is parametric in its argument, so we have Vector Float, Vector (Vector Float) etc; but we have special type-transation rules for Vector R, as in 8.3.

  2. Vector-of-real is a new primitive type, call it VectorF , in addition to Vector a.

Yes, we do have these two plans. BTW, in plan 2 the vector doesn't have to be monomorphic. It's enough that its argument is a primitive enough type representing real numbers (e.g., Double, Float, CFloat, etc.). But for a paper we can fix a real numbers type and that also simplifies the code a bit. Let us also fix the type to Float in this dicussion. However, the points is, there is no performance gain at all from fixing Float --- the whole gain is from fixing Vector and making it a primitive that Delta is aware of.

However, we don't have an implementation that works on matrices

But if we have an implementation that works on vectors then we have one that works on matrices -- just a vector of vectors. No? (It might not be very efficient.)

Yes, we could do that in plan 1, but it's going to be less efficient. Moreover, it precludes plan 2, because we will need a primitive type MatrixF.

I think you are advocating for (2) but I don't think we have any evidence for (1) vs (2) yet. Let's work out the consequences of both.

For (2) we should set out:

The type translation for `VectorF`. Something like
  ```
  D[[VectorF]] = VectorF * ??    

  What is ??.  It could be `Vector Delta`.

That would defeat the purpose. We'd have 10^8 vector of deltas (which is an order of magnitude larger than a 10^8 vector of Double).

Or maybe just Delta,

Yes, that's what I have in my dirty sketch. Delta has a type parameter and we have

  D[[VectorF]] = VectorF * Delta VectorF

if Delta has some new constructors? What new contructors, precisely?

Yes, it does, see the wiki page, but they are only required once you want to use some extra operations on vectors. The existing Delta r constructors work on Delta VectorF out of the box, e.g., the whole Num is at your disposal.

The operations we have for VectorF. Presumably there are dozens and dozens of them: as well as dot product, zipWith, we might want rotate, shift, concatenate, interleave and lots more. Let's list them, if only to force ourselves to confront how many we need.

That's a rather mad idea, given that we not only want operations from Data.Vector.Generic, but also from hmatrix, vector-algorithms and more. Most of them don't require Delta extensions, because they are defined in terms of other operations. Some of them can be efficiently enough implemented once we have Delta constructors translating both ways between Vector(Delta) and Delta(Vector). Some will require genuine Delta extensions. For MNIST done with vectors, I needed 3 such extensions (2 would suffice, but would be a bit slower; not benchmarked yet). Form MNIST via arrays I needed 2 more.

Answering to the previous comment:

1. Answer part 1.  Change the D[] translation on types, as shown in Section 8.3 of the paper, specifically
   ```
     D[Vector R] = Vector R * Vector Delta
     D[Vector A] = Vector D[A]     if A /= R
   ``` 

   That does not, by itself, imply a change in `Delta`, does it?

Yes, I think so. However, Plan 1 is hard for other reasons. Doing Vector R * Vector Delta is, in general, very hard in a way that realises the promised efficiency gain. When I implemented plan 1, in most of the places I only had Vector (R * Delta), which is inefficient, but easy. (However, in one place where I had Vector R * Vector Delta it increased performance dramatically).

2. Answer part 2.  For each operation on vectors we need to write their D versions.  E.g
index_A = (Vector A * N) -> A

Yes, though I didn't need index yet --- for MNIST it wasn't necessary, nor was build.

I'm quite lost in the technical details of implementing index and build and I can't think straight without a typechecker, so let me revisit this once we have a significant test code that uses index and/or build (though perhaps Seq from the wiki is build or close?). Also, once we settle on Plan 1 or Plan 2 it may be easier to focus on details.

Mikolaj commented 2 years ago

In summary, I'd argue for

Mikolaj commented 2 years ago

@simonpj, you may be right vectors of vectors ought to be enough for anybody. I've hit a performance wall with my Delta MatrixF implementation of MNIST. Although it allocates an order of magnitude less delta expressions than a Delta VectorF implementation and moves over a much higher percentage of allocations and computations to C library BLAS, resulting in almost 3 times lower average heap size and GC copied bytes, it is significantly less performant than a Delta VectorF implementation. The cause is allocating ephemeric matrices, but the technical details are not particularly important (they are explained in the code snippet below). The point is, the number of allocated delta expressions is not always crucial and the representation forced by Matrix sometimes (often?) make the much less numerous delta expressions larger in total due to them carrying inside matrix coefficients fully computed from sparse data.

https://github.com/Mikolaj/horde-ad/blob/db66d2c4fe6f238a9f83eafe912b5067c7ebf88f/src/HordeAd/Delta.hs#L99-L106

So, perhaps we should do Delta VectorF, Delta (Vector VectorF), Delta (Vector x) and just define matrix API (and tensors) on top of that. This complicates other parts of the implementation, but I hope it would simplify delta expression syntax and certainly make it more regular. I've just moved to a strict boxed vector implementation, so one of the reasons to move to Matrix (and to move to Vector R * Vector Delta) is gone. I also suspect hmatrix uses BLAS vectors, not matrices (if there are any and if there is any distinction), so perhaps hmatrix Matrix operations don't give us any speedup over using equivalent sequences of vector operations. There is also package massiv and also some raw bindings to BLAS and there may be other C libraries, so we have options and we have reconsider the primitives in the future. I'm insisting on hmatrix for the initial round to meaningfully benchmark agains backprop and the manual gradient expressed in hmatrix.

P.S. I looked up LAPACK that was assigned 20% runtime in the profiling reports and it's a library called by BLAS. It's written in FORTRAN.

P.P.S. Some dirty optimizations brought matrix-based MNIST down to vector-based level

https://github.com/Mikolaj/horde-ad/blob/cb7b1842dfb4d63e66679ee02de7422c2db852ad/src/HordeAd/Delta.hs#L103-L119

so perhaps on larger and matrix-oriented examples (e.g., convolutional nns; fully-connected nns are really computationally vector-oriented, even if more concisely expressed with matrices) it would prove faster than vector-based approaches, which may justify keeping the separate Matrix primitive in the engine, as opposed to defining user-accessible matrices using vectors only.

simonpj commented 2 years ago

Wow you work so fast!

But I'm still completely lost. Could you write down:

  1. What new data type(s) you have (beyond Float, pairs, arrow, etc Fig 1). I think you just have Vector A
  2. What operations you offer over that type. index? build? zipWith? ...
  3. What type translation you are using (Fig 9).
  4. What Delta looks like.
  5. What are the derivatives of the operations in (2)
  6. What eval looks like

Simply writing these things down, together, in one place, would be super helpful. If you are working with several variants, you can always copy/paste to give N blocks of answers; or give a main block plus diffs.

Mikolaj commented 2 years ago

What new data type(s) you have

Just a reminder that I don't implement the surface language nor the translation. I implement combinators in the codomain of the translation (which is Haskell) that the surface language constructs would map to (each syntactic combinator maps to a semantic combinator; let's gloss over terms with free variables), and then I compose the combinators in Haskell to get programs. Moreover, the user of the library is expected to do just that and, in particular, implement her own extra combinators (operations on dual numbers), which includes defining derivatives.

Having said that, yes, I can explain the design in terms of the hypothetical surface language. Let me do that.

  1. What new data type(s) you have (beyond Float, pairs, arrow, etc Fig 1). I think you just have Vector A

Yes, let's focus on the case where we only have Vector (no Matrix), as in section 8.3, Plan 1a together with Plan 2. So, this is both Vector A, where A is any type of the surface language, and VectorF, which is a primitive type that can only hold values of type Float.

  1. What operations you offer over that type. index? build? zipWith? ...

For now, I offer mostly whatever is necessary for MNIST (and the README example solicited by @awf, see https://github.com/Mikolaj/mostly-harmless/discussions/9), because that's what I can test and/or get reviewed. There are the following operations on Vector A (not all methods of these are necessary nor implemented, but they could be): Num, Fractional, Floating, RealFloat, dot product <.>, konst (a vector of a given length with one scalar repeated), sumElements (of a vector), deltaSeq :: Vector Float -> VectorF (that's the signature of the syntactic combinator in the surface language, the signature of the semantic combinator in the target language is given 2 screens above). These operations I had to define myself (either on the level of scalars and they were automatically lifted to vectors elementwise, or at the level of vectors explicitly).

In addition, I use in MNIST some operations I get for free from Haskell package vector: Data.Vector.generate and Data.Vector.length. The signature in the surface language of the former would be generate :: Int -> (Int -> a) -> Vector a.

That's it. I don't have vector indexing nor build (though, of course, I could have --- but I couldn't test so I would delude myself).

3. What type translation you are using (Fig 9).

Vector A goes to Data.Vector.Vector applied to the translation of A. VectorF goes to the dual numbers type where the primal component is Data.Vector.Storable.Vector Float, where Data.Vector.Storable.Vector is a type constructor of unboxed vectors (abbreviated to Vector in the code, not to be confused with the boxed vector type constructor Data.Vector.Vector). You would denote the translated dual numbers type as Vector Float * Delta in the paper,

4. What Delta looks like.

In the following, please substitute VectorF for Vector r, then Float for r. Type variable a is either Float or VectorF, because these are the only two primitive differentiable types we have (future work: allow nested vectors there?).

https://github.com/Mikolaj/horde-ad/blob/0521a69caff8e4da302c9710a6f66b2295e8dab4/src/HordeAd/Delta.hs#L36-L44

5. What are the derivatives of the operations in (2)

Here are the derivatives of all the operations I had to write myself (and then some)

https://github.com/Mikolaj/horde-ad/blob/master/src/HordeAd/DualNumber.hs

And for Data.Vector.generate, the derivative is elementwise and we get it for free and Data.Vector.length doesn't have a derivative (integer codomain).

6. What eval looks like

https://github.com/Mikolaj/horde-ad/blob/0521a69caff8e4da302c9710a6f66b2295e8dab4/src/HordeAd/Delta.hs#L103-L127

Please disregard operations *L, which are for MatrixF, which we may choose not to implement (as a primitive type). Also please disregard the errors, which would disappear if the r was replaced by Float in definition of Delta (the cause of the errors is that in the Delta type definition we conflate parameterization by the choice of Float/Double/CFloat/etc. and parameterization by the computation level (scalar, vector, matrix); if we agree on the design and direction, and both parametricity kinds stay in, I'm going to ask @goldfirere to have a look and untangle this mess and make the GADT live up to its promise and automatically rule out these patterns as redundant),

P.S. We should quickly sort out the confusion of Vector in the surface language being translated to Data.Vector.Vector in the target language and not to Vector (unboxed, storable vectors), to which VectorF in the surface language is translated in turn. Perhaps we should start by deciding the final name for VectorF in the surface language, and if it should be parameterized and nested (this is messy argument out of scope of this preliminary discussion, though) or just a fixed instance for Float and we don't care it doesn't match the library that obviously also needs Double and could also use nesting if we decide against primitive MatrixF.

Mikolaj commented 2 years ago

@simonpj: does that make sense or is it still not readable? Would it help if I updated the wiki page with a snapshot of the current 1-vector operations, with all the substitutions of Float for r, etc., already performed and whatever else from the answers to your 6 questions I can cheaply snapshot? Or is there a different communication problem?

@fkrawiec, @neel-krishnaswami: any feedback so far? meta-feedback?

Mikolaj commented 1 year ago

This is sooo outdated at this point that I'm going to remove the wiki pages and just close. Thanks a lot for the discussion!