JuliaDiff / ForwardDiff.jl

Forward Mode Automatic Differentiation for Julia
Other
892 stars 148 forks source link

Finalizing basics of forward mode with operator overloading and of package structuring #4

Closed papamarkou closed 11 years ago

papamarkou commented 11 years ago

I opened an issue so that we conclude the ongoing discussion on possible partial rewriting of the forward AD code.

  1. Tom, I understand your last message re the gradient. My problem with the output is that the gradient doesn't need to be an nxn square matrix. All we need is its diagonal! This is more consistent with the mathematical definition of gradient, plus it takes less space to store (n numbers instead of n^2 since the matrix is diagonal).
  2. Are we dealing with f:R^n->R smooth functions only or with the more general case f:R^n->R^m?
  3. Kevin, as long as we won't need a separate type for reverse mode, Dual is not so bad choice. autodiff() is my favourite name for our AD functions, nevertheless we need to address the issue of having several AD modes, such as forward, reverse or hybrid. This could be done by using separate function names, which we need to device, or by having a function argument mode in autodiff(), i.e. autodiff(x, ..., mode="forward") so as to choose the mode of differentiation. To generalize further, we may need another function argument, say method, to choose between operator overloading (oo) and source code transformation (sct), for example autodiff(x, ..., mode="forward", method="oo").
  4. If the n parameter of the ADForward type does not exhibit any improvement in performance, I suggest we remove it.
  5. Kevin did a very tidy job on modularization so far. We could do a tiny bit more of structuring and file naming once we address the previous questions.
papamarkou commented 11 years ago

It slowly gets formulated in my mind. Here is how we could restructure the code. I suggest we abide to the performance tip of Julia's manual according to which we ought to split the functions in multiple definitions.

The minimum type block of automatic differentiation is that of a Dual number, whose first element is the value of the function and the second element is the derivative rather than the gradient. This, among else, validates the choice of Dual as a number for our type.

We can then proceed by defining our algebra based on Dual. This suffices apparently for univariate functions. At the same time, it allows performing partial differentiation on multivariable functions f:R^n->R. So if (x_1, ..., x_n) is an element of f's domain, we only need to define one more member inside Dual, which will be the index i. Then the partial derivative f_i:=df/dx_i is computed using Dual arithmetic. In the univariate case, i defaults to 1.

The gradient will then be calculated by a function wrapper that invokes the more elementary function which calculates the partial derivative f_i a total of n times for i=1, ..., n.

As for the mode and method arguments of autodiff(), we can apply the same principle; autodiff() will wrap all the other more elementary functions for forward or reverse mode, oop or sct using control flow.

tshort commented 11 years ago

I'd like to support the more general case of f:R^n->R^m. In this more general case, I think that the gradient is not just the diagonal. Here's an example:

julia> a = ad([4., 5.])
2-element ADForward{Float64,2} Array:
 ADForward(4.0,[1.0, 0.0])
 ADForward(5.0,[0.0, 1.0])

julia> b = similar(a)
2-element ADForward{Float64,2} Array:
 #undef
 #undef

julia> b[1] = sin(a[1]) + exp(a[2])
ADForward(147.65635660726866,[-0.653644, 148.413])

julia> b[2] = cos(a[1])
ADForward(-0.6536436208636119,[0.756802, 0.0])

julia> b
2-element ADForward{Float64,2} Array:
 ADForward(147.65635660726866,[-0.653644, 148.413])
     ADForward(-0.6536436208636119,[0.756802, 0.0])

julia> gradient(b)
2x2 Float64 Array:
 -0.653644  148.413
  0.756802    0.0  

This case complicates the API. Storage-wise, it'd be nice to just store the gradient matrix, but I haven't come up with a way to specify that.

papamarkou commented 11 years ago

Hi Tom, this is true, the gradient is defined for multivariable functions f:R^n->R^m too, and I agree that we should support them. It is really a matter of jargon; the gradient for f:R^n->R^m is an nxm matrix and is conventionally called the Jacobian matrix. The term gradient is more commonly reserved for n-dimensional vectors in the case of f:R^n->R. My view is that if we start from the minimal building block of dual numbers, as described above, we can then derive gradients, Jacobians, Hessians and third order derivatives. For instance, if we start with dual numbers, we can then have two nested loops in order to iterate over the lines and rows of the Jacobian J, calculate each of the nxm duals df_i/dx_j and store it in the (i,j)-th position of J. I am not sure if this "devectorization" speeds up or slows down the Julia runtime, when compared to more "vectorized" software engineering. All I know at this stage is that it seems more natural approach to start by defining the algebra on dual numbers and build on it to define gradients as Float64 vectors, Jacobians as matrices etc. We could try the approach I propose and compare it to the existing implementation, if you agree.

tshort commented 11 years ago

Sounds good, Theo.

On Mon, Apr 15, 2013 at 10:30 AM, Theodore Papamarkou < notifications@github.com> wrote:

Hey Tom, this is true, the gradient is defined for multivariable functions f:R^n->R^m too, and I agree that we should support them. It is really a matter of jargon; the gradient for f:R^n->R^m is an nxm matrix and is conventionally called the Jacobian matrix. The term gradient is more commonly reserved for n-dimensional vectors in the case of f:R^n->R. My view is that if we start from the minimal building block of dual number, as described above, we can then derive gradients, Jacobians, Hessians and third order derivatives. For instance, if we start with dual numbers, we can then have two nested loops in order to iterate over the lines and rows of the Jacobian J, calculate each of the nxm duals df_i/df_j and store it in (i,j)-th position of J. I am not sure if this "devectorization" speeds up or slows down the Julia runtime, when compared to more "vectorized" software engineering. All I know at this stage is that it seems more natural approach to start by defining the algebra on dual numbers and build on it to define gradients as vectors, Jacobians as matrices etc. We could try the approach I propose and compare it to the existing implementation, if you agree.

— Reply to this email directly or view it on GitHubhttps://github.com/scidom/AutoDiff.jl/issues/4#issuecomment-16388140 .

papamarkou commented 11 years ago

Great Tom, I will do this just to compare the two approaches in terms of speed (thanks for the tests by the way that you and Kevin wrote, they will serve as good benchmarks).

tshort commented 11 years ago

The existing examples are mainly just to try out the API. We probably need something more complex for benchmarks.

On Mon, Apr 15, 2013 at 10:35 AM, Theodore Papamarkou < notifications@github.com> wrote:

Great Tom, I will do this just to compare the two approaches in terms of speed (thanks for the tests by the way that you and Kevin wrote, they will serve as good benchmarks).

— Reply to this email directly or view it on GitHubhttps://github.com/scidom/AutoDiff.jl/issues/4#issuecomment-16388522 .

kmsquire commented 11 years ago

Okay, just catching up on this discussion. Some additional thoughts Theo's original items

  1. Kevin, as long as we won't need a separate type for reverse mode, Dual is not so bad choice. autodiff() is my favourite name for our AD functions, nevertheless we need to address the issue of having several AD modes, such as forward, reverse or hybrid. This could be done by using separate function names, which we need to device, or by having a function argument mode in autodiff(), i.e. autodiff(x, ..., mode="forward") so as to choose the mode of differentiation. To generalize further, we may need another function argument, say method, to choose between operator overloading (oo) and source code transformation (sct), for example autodiff(x, ..., mode="forward", method="oo").

Regarding Dual as a name: since a dual number is a mathematical entity with a specific definition, I'd like not to stray too far from it, and the current implementation of ADForward doesn't really fit, since we are carrying forward one column of the jacobian for each element (or equivalently, we need to carry forward the jacobian).

I'll also note that, even for f:R^n->R, we need the full Jacobian for anything beyond simple functions, as the initial elements may be combined in arbitrary ways before the end of the computation.

The problem with actually using dual numbers is that you need to keep track of the meaning of the derivative/epsilon term. If we only want the partial derivative with respect to, say x1, there's no problem: the derivative always means dx_1, and the derivative of

[x_1, x_2, x_3, ..., x_n]'

with respect to x_1 is obviously

[1, 0, 0 ..., 0]'

However, the current code is more general and, as was pointed out by you and Tom, produces the full Jacobian, so in ADForward, the dx term means [dx_1, dx_2, ... , dx_n].

(My first thought in reading the wikipedia pages on dual numbers and automatic differentiation was to try just this, but I quickly realized that I could only take one partial derivative at a time).

  1. If the n parameter of the ADForward type does not exhibit any improvement in performance, I suggest we remove it.

Let's test it and see. ;-)

  1. Kevin did a very tidy job on modularization so far. We could do a tiny bit more of structuring and file naming once we address the previous questions.

Agree that it could still use some restructuring.

papamarkou commented 11 years ago

It is very helpful and pleasing that we share thoughts. So, yes, if we want to be (mathematically) strict, we are not using dual numbers (I used the term loosely, which may had brought confusion). What I meant is that we can define our algebra on ordered pairs (f_i, df_i/dx_j), where the first element is the i-th ordinate of the value of a function f:R^n->R^m and df_i/dx_j is the derivative of the i-th ordinate of f with respect to the j-th ordinate of (x_1, ..., x_n) in D(f). In other words, and to make the connection to the previous comments I made, the second member of the ordered pair is the (i, j)-th element of the Jacobian, not the full Jacobian. So I agree that we take one partial derivative at a time this way. I guess you stress this out Kevin, because this approach apparently implies storing multiple times the value of f in the first pairs, which is inefficient. If that's your point, I agree. I found Griewank's book today, let me see what is his approach to forward AD and I will get back to you.

kmsquire commented 11 years ago

Hi Theo, thanks for the clarification. I hadn't read your previous notes closely enough to understand that that was your proposal--sorry about that. But if I had understood better, my response probably would have been that it wasn't space efficient. ;-)

However, thinking about it more, it may be worthwhile to explore that representation. in particular, storing numbers that way would allow them to be treated as complex numbers, which with appropriate scaling, might allow the use of complex BLAS functions to calculate the derivative directly--see http://dl.acm.org/citation.cfm?id=838251 for an example of this.

This actually could address a concern that I had with the operator overloading approach: for functions written entirely within the language, it works well, but as soon as you want to call out to an external library (such as BLAS), you're out of luck (or at least you have more work to do).

Cheers, Kevin

On Mon, Apr 15, 2013 at 11:41 AM, Theodore Papamarkou < notifications@github.com> wrote:

It is very helpful and pleasing that we share thoughts. So, yes, if we want to be (mathematically) strict, we are not using dual numbers (I used the term loosely, which may had brought confusion). What I meant is that we can define our algebra on ordered pairs (f_i, df_i/dx_j), where the first element is the i-th ordinate of the value of a function f:R^n->R^m and df_i/dx_j is the derivative of the i-th ordinate of f with respect to the j-th ordinate of (x_1, ..., x_n) in D(f). In other words, and to make the connection to the previous comments I made, the second member of the ordered pair is the (i, j)-th element of the Jacobian, not the full Jacobian. So I agree that we take one partial derivative at a time this way. I guess you stress this out Keving, because this approach apparently implies storing multiple times the value of f in the first pairs, which is inefficient. If that's your point, I agree. I found Griewank's book today, let me see what is his approach to forward AD and I will get back to you

— Reply to this email directly or view it on GitHubhttps://github.com/scidom/AutoDiff.jl/issues/4#issuecomment-16403596 .

papamarkou commented 11 years ago

Hi Kevin, this is a very interesting article. Nevertheless, I am rather convinced after reading it that my suggested approach is inefficient. They suggest two different approaches in the paper. The first one makes use of the Cauchy-Riemann equations and involves a truncation error, whereas I think we prefer exact (to the computer's percision) AD. The second approach uses a Taylor series expansion on the complex plane, which is equivalent to the forward mode of AD. This so-called complex-step method requires one additional operation in comparison to AD, and it is therefore less efficient in general.

papamarkou commented 11 years ago

Kevin, Tom, I wanted to thank you both for the code improvements you made. Give me a couple more days as I am still thinking over the API and I will get back to you.

papamarkou commented 11 years ago

As you will see, I created the Dual type in a generic way. It can be seen as an alternative to complex numbers (e^2=0 instead of i^2=-1) and can be used for performing dual algebra independently of the AutoDiff package. It can be used directly for performing forward AD on univariate functions without needing to create any AD specific types.

I will carry on with this complex step differentiation implementation by defining one more type soon, so as to extend AD to multivariate functions.

papamarkou commented 11 years ago

P.S. The Dual type is better off as a package DualNumbers due to its general applicability in the implementation of dual algebra (the package name was suggested by Stefan). We can then invoke it via using DualNumbers. As for the additional dual type that I need to define specifically for performing complex step differentiation on multivariate functions, it is better to make it part of our AutoDiff package.