Closed papamarkou closed 11 years ago
It slowly gets formulated in my mind. Here is how we could restructure the code. I suggest we abide to the performance tip of Julia's manual according to which we ought to split the functions in multiple definitions.
The minimum type block of automatic differentiation is that of a Dual
number, whose first element is the value
of the function and the second element is the derivative
rather than the gradient
. This, among else, validates the choice of Dual
as a number for our type.
We can then proceed by defining our algebra based on Dual
. This suffices apparently for univariate functions. At the same time, it allows performing partial differentiation on multivariable functions f:R^n->R. So if (x_1, ..., x_n) is an element of f's domain, we only need to define one more member inside Dual
, which will be the index i. Then the partial derivative f_i:=df/dx_i is computed using Dual
arithmetic. In the univariate case, i defaults to 1.
The gradient will then be calculated by a function wrapper that invokes the more elementary function which calculates the partial derivative f_i a total of n times for i=1, ..., n.
As for the mode
and method
arguments of autodiff()
, we can apply the same principle; autodiff()
will wrap all the other more elementary functions for forward or reverse mode, oop or sct using control flow.
I'd like to support the more general case of f:R^n->R^m. In this more general case, I think that the gradient is not just the diagonal. Here's an example:
julia> a = ad([4., 5.])
2-element ADForward{Float64,2} Array:
ADForward(4.0,[1.0, 0.0])
ADForward(5.0,[0.0, 1.0])
julia> b = similar(a)
2-element ADForward{Float64,2} Array:
#undef
#undef
julia> b[1] = sin(a[1]) + exp(a[2])
ADForward(147.65635660726866,[-0.653644, 148.413])
julia> b[2] = cos(a[1])
ADForward(-0.6536436208636119,[0.756802, 0.0])
julia> b
2-element ADForward{Float64,2} Array:
ADForward(147.65635660726866,[-0.653644, 148.413])
ADForward(-0.6536436208636119,[0.756802, 0.0])
julia> gradient(b)
2x2 Float64 Array:
-0.653644 148.413
0.756802 0.0
This case complicates the API. Storage-wise, it'd be nice to just store the gradient matrix, but I haven't come up with a way to specify that.
Hi Tom, this is true, the gradient is defined for multivariable functions f:R^n->R^m too, and I agree that we should support them. It is really a matter of jargon; the gradient for f:R^n->R^m is an nxm matrix and is conventionally called the Jacobian matrix. The term gradient is more commonly reserved for n-dimensional vectors in the case of f:R^n->R.
My view is that if we start from the minimal building block of dual numbers, as described above, we can then derive gradients, Jacobians, Hessians and third order derivatives. For instance, if we start with dual numbers, we can then have two nested loops in order to iterate over the lines and rows of the Jacobian J, calculate each of the nxm duals df_i/dx_j and store it in the (i,j)-th position of J. I am not sure if this "devectorization" speeds up or slows down the Julia runtime, when compared to more "vectorized" software engineering. All I know at this stage is that it seems more natural approach to start by defining the algebra on dual numbers and build on it to define gradients as Float64
vectors, Jacobians as matrices etc. We could try the approach I propose and compare it to the existing implementation, if you agree.
Sounds good, Theo.
On Mon, Apr 15, 2013 at 10:30 AM, Theodore Papamarkou < notifications@github.com> wrote:
Hey Tom, this is true, the gradient is defined for multivariable functions f:R^n->R^m too, and I agree that we should support them. It is really a matter of jargon; the gradient for f:R^n->R^m is an nxm matrix and is conventionally called the Jacobian matrix. The term gradient is more commonly reserved for n-dimensional vectors in the case of f:R^n->R. My view is that if we start from the minimal building block of dual number, as described above, we can then derive gradients, Jacobians, Hessians and third order derivatives. For instance, if we start with dual numbers, we can then have two nested loops in order to iterate over the lines and rows of the Jacobian J, calculate each of the nxm duals df_i/df_j and store it in (i,j)-th position of J. I am not sure if this "devectorization" speeds up or slows down the Julia runtime, when compared to more "vectorized" software engineering. All I know at this stage is that it seems more natural approach to start by defining the algebra on dual numbers and build on it to define gradients as vectors, Jacobians as matrices etc. We could try the approach I propose and compare it to the existing implementation, if you agree.
— Reply to this email directly or view it on GitHubhttps://github.com/scidom/AutoDiff.jl/issues/4#issuecomment-16388140 .
Great Tom, I will do this just to compare the two approaches in terms of speed (thanks for the tests by the way that you and Kevin wrote, they will serve as good benchmarks).
The existing examples are mainly just to try out the API. We probably need something more complex for benchmarks.
On Mon, Apr 15, 2013 at 10:35 AM, Theodore Papamarkou < notifications@github.com> wrote:
Great Tom, I will do this just to compare the two approaches in terms of speed (thanks for the tests by the way that you and Kevin wrote, they will serve as good benchmarks).
— Reply to this email directly or view it on GitHubhttps://github.com/scidom/AutoDiff.jl/issues/4#issuecomment-16388522 .
Okay, just catching up on this discussion. Some additional thoughts Theo's original items
- Kevin, as long as we won't need a separate type for reverse mode, Dual is not so bad choice. autodiff() is my favourite name for our AD functions, nevertheless we need to address the issue of having several AD modes, such as forward, reverse or hybrid. This could be done by using separate function names, which we need to device, or by having a function argument mode in autodiff(), i.e. autodiff(x, ..., mode="forward") so as to choose the mode of differentiation. To generalize further, we may need another function argument, say method, to choose between operator overloading (oo) and source code transformation (sct), for example autodiff(x, ..., mode="forward", method="oo").
Regarding Dual as a name: since a dual number is a mathematical entity with a specific definition, I'd like not to stray too far from it, and the current implementation of ADForward
doesn't really fit, since we are carrying forward one column of the jacobian for each element (or equivalently, we need to carry forward the jacobian).
I'll also note that, even for f:R^n->R, we need the full Jacobian for anything beyond simple functions, as the initial elements may be combined in arbitrary ways before the end of the computation.
The problem with actually using dual numbers is that you need to keep track of the meaning of the derivative/epsilon term. If we only want the partial derivative with respect to, say x1
, there's no problem: the derivative always means dx_1
, and the derivative of
[x_1, x_2, x_3, ..., x_n]'
with respect to x_1
is obviously
[1, 0, 0 ..., 0]'
However, the current code is more general and, as was pointed out by you and Tom, produces the full Jacobian, so in ADForward, the dx
term means [dx_1, dx_2, ... , dx_n]
.
(My first thought in reading the wikipedia pages on dual numbers and automatic differentiation was to try just this, but I quickly realized that I could only take one partial derivative at a time).
- If the n parameter of the ADForward type does not exhibit any improvement in performance, I suggest we remove it.
Let's test it and see. ;-)
- Kevin did a very tidy job on modularization so far. We could do a tiny bit more of structuring and file naming once we address the previous questions.
Agree that it could still use some restructuring.
It is very helpful and pleasing that we share thoughts. So, yes, if we want to be (mathematically) strict, we are not using dual numbers (I used the term loosely, which may had brought confusion). What I meant is that we can define our algebra on ordered pairs (f_i, df_i/dx_j), where the first element is the i-th ordinate of the value of a function f:R^n->R^m and df_i/dx_j is the derivative of the i-th ordinate of f with respect to the j-th ordinate of (x_1, ..., x_n) in D(f). In other words, and to make the connection to the previous comments I made, the second member of the ordered pair is the (i, j)-th element of the Jacobian, not the full Jacobian. So I agree that we take one partial derivative at a time this way. I guess you stress this out Kevin, because this approach apparently implies storing multiple times the value of f in the first pairs, which is inefficient. If that's your point, I agree. I found Griewank's book today, let me see what is his approach to forward AD and I will get back to you.
Hi Theo, thanks for the clarification. I hadn't read your previous notes closely enough to understand that that was your proposal--sorry about that. But if I had understood better, my response probably would have been that it wasn't space efficient. ;-)
However, thinking about it more, it may be worthwhile to explore that representation. in particular, storing numbers that way would allow them to be treated as complex numbers, which with appropriate scaling, might allow the use of complex BLAS functions to calculate the derivative directly--see http://dl.acm.org/citation.cfm?id=838251 for an example of this.
This actually could address a concern that I had with the operator overloading approach: for functions written entirely within the language, it works well, but as soon as you want to call out to an external library (such as BLAS), you're out of luck (or at least you have more work to do).
Cheers, Kevin
On Mon, Apr 15, 2013 at 11:41 AM, Theodore Papamarkou < notifications@github.com> wrote:
It is very helpful and pleasing that we share thoughts. So, yes, if we want to be (mathematically) strict, we are not using dual numbers (I used the term loosely, which may had brought confusion). What I meant is that we can define our algebra on ordered pairs (f_i, df_i/dx_j), where the first element is the i-th ordinate of the value of a function f:R^n->R^m and df_i/dx_j is the derivative of the i-th ordinate of f with respect to the j-th ordinate of (x_1, ..., x_n) in D(f). In other words, and to make the connection to the previous comments I made, the second member of the ordered pair is the (i, j)-th element of the Jacobian, not the full Jacobian. So I agree that we take one partial derivative at a time this way. I guess you stress this out Keving, because this approach apparently implies storing multiple times the value of f in the first pairs, which is inefficient. If that's your point, I agree. I found Griewank's book today, let me see what is his approach to forward AD and I will get back to you
— Reply to this email directly or view it on GitHubhttps://github.com/scidom/AutoDiff.jl/issues/4#issuecomment-16403596 .
Hi Kevin, this is a very interesting article. Nevertheless, I am rather convinced after reading it that my suggested approach is inefficient. They suggest two different approaches in the paper. The first one makes use of the Cauchy-Riemann equations and involves a truncation error, whereas I think we prefer exact (to the computer's percision) AD. The second approach uses a Taylor series expansion on the complex plane, which is equivalent to the forward mode of AD. This so-called complex-step method requires one additional operation in comparison to AD, and it is therefore less efficient in general.
Kevin, Tom, I wanted to thank you both for the code improvements you made. Give me a couple more days as I am still thinking over the API and I will get back to you.
As you will see, I created the Dual
type in a generic way. It can be seen as an alternative to complex numbers (e^2=0 instead of i^2=-1) and can be used for performing dual algebra independently of the AutoDiff package. It can be used directly for performing forward AD on univariate functions without needing to create any AD specific types.
I will carry on with this complex step differentiation implementation by defining one more type soon, so as to extend AD to multivariate functions.
P.S. The Dual
type is better off as a package DualNumbers
due to its general applicability in the implementation of dual algebra (the package name was suggested by Stefan). We can then invoke it via using DualNumbers
. As for the additional dual type that I need to define specifically for performing complex step differentiation on multivariate functions, it is better to make it part of our AutoDiff
package.
I opened an issue so that we conclude the ongoing discussion on possible partial rewriting of the forward AD code.
Dual
is not so bad choice.autodiff()
is my favourite name for our AD functions, nevertheless we need to address the issue of having several AD modes, such as forward, reverse or hybrid. This could be done by using separate function names, which we need to device, or by having a function argumentmode
inautodiff()
, i.e.autodiff(x, ..., mode="forward")
so as to choose the mode of differentiation. To generalize further, we may need another function argument, saymethod
, to choose between operator overloading (oo) and source code transformation (sct), for exampleautodiff(x, ..., mode="forward", method="oo")
.n
parameter of theADForward
type does not exhibit any improvement in performance, I suggest we remove it.