JuliaDiff / ForwardDiff.jl

Forward Mode Automatic Differentiation for Julia
Other
887 stars 141 forks source link

Are second derivatives (and higher) inefficient? #149

Open dbeach24 opened 7 years ago

dbeach24 commented 7 years ago

I'm using the Duals implementation in ForwardDiff and have a question about taking higher derivatives.

Consider this simple example:

using ForwardDiff
f(x) = sin(x) / x

@show y0 = f(10.0) # function
@show y1 = f(Dual(10.0, 1.0))  # function + 1st derivative
@show y2 = f(Dual(Dual(10.0, 1.0), Dual(1.0, 1.0))) # function + 1st,2nd derivatives

The 2nd derivative case seems unnecessarily inefficient because I must substitute a Dual number for both the value and partial components of the 1st (i.e. outer) Dual. If you run the code, you can see that y2.value.partials[1] == y2.partials[1].value, and that both give the value of 1st derivative. Must we compute this value twice? Obviously, this problem is more serious when one has a vector of derivatives to be evaluated, etc.

A modest proposal: Could we allow the types of the value and partials be different? That would seem to allow code like the following:

y2 = f(Dual(10.0, Dual(1.0, 1.0)))

Is there already some support for this, and I'm just missing it?

jrevels commented 7 years ago

They're set up this way so that nested dual numbers correspond algebraically to hyper-dual numbers; otherwise, you'd be throwing out a cross term. What you're suggesting could still be useful for scalar derivatives, but...

Obviously, this problem is more serious when one has a vector of derivatives to be evaluated, etc.

This is actually where that extra cross term is necessary for the sake of correctness. For example, to take a 3x3 Hessian with Duals in a single function call, the seed structure looks like this (where x = [2,3,4]):

3-element Array{ForwardDiff.Dual{3,ForwardDiff.Dual{3,Int64}},1}:
 Dual(Dual(2,1,0,0),Dual(1,0,0,0),Dual(0,0,0,0),Dual(0,0,0,0))
 Dual(Dual(3,0,1,0),Dual(0,0,0,0),Dual(1,0,0,0),Dual(0,0,0,0))
 Dual(Dual(4,0,0,1),Dual(0,0,0,0),Dual(0,0,0,0),Dual(1,0,0,0))

Anyway, I'd definitely be open to a PR which implements your suggestion, if we can prove the compile-time performance doesn't suffer because of it.

dbeach24 commented 7 years ago

@jrevels Thanks for the background. Please allow me to build up to my use case:

I start with a small vector-valued function (let's assume the example of cartesian to spherical coordinates in 3-space). What I want to do is to take multiple derivatives of this function. For example:

  1. Evaluate with a position+velocity (6-state) vector:

    This, of course, is extremely straightforward:

    result = f([Dual(x, vx), Dual(y, vy), Dual(z, vz)])
  2. Evaluate the Jacobian at a position+velocity vector:

    This is also not hard to do, but we now see some "mixing" of different dual structures.

    idx(i) = ntuple(x->(x==i ? 1.0 : 0.0), 6)
    result = f([
       Dual(Dual(x, idx(1)), Dual(vx, idx(4))),
       Dual(Dual(y, idx(2)), Dual(vy, idx(5))),
       Dual(Dual(z, idx(3)), Dual(vz, idx(6))),
    ])

    This is what I found to work for taking both a "time derivative" and a Jacobian simultaneously. Is there a better way?

  3. Evaluate with a position+velocity+acceleration (9-state) vector:

    result = f([
       Dual(Dual(x, vx), Dual(vx, ax)),
       Dual(Dual(y, vy), Dual(vy, ay)),
       Dual(Dual(z, vz), Dual(vz, az)),
    ])

    In the above example, we nest the Duals, but the structure is still consistent. The transformed velocity components are computed twice, a minor inefficiency which could be solved by my proposal.

  4. Evaluate the Jacobian at a position+velocity+acceleration (9-state) vector:

    idx(i) = ntuple(x->(x==i ? 1.0 : 0.0), 9)
    result = f([
       Dual(Dual(Dual(x, idx(1)), Dual(vx, idx(4))), Dual(Dual(vx, idx(4)), Dual(ax, idx(7)))),
       Dual(Dual(Dual(y, idx(2)), Dual(vy, idx(5))), Dual(Dual(vy, idx(5)), Dual(ay, idx(8)))),
       Dual(Dual(Dual(z, idx(3)), Dual(vz, idx(6))), Dual(Dual(vz, idx(6)), Dual(az, idx(9)))),
    ])

    (Notice how this mirrors the structure of example 3 above, except that each component (e.g. vx) is now replaced with a dual number that includes its position in the state vector, i.e. vx => Dual(vx, idx(4)).)

    Here, the problem becomes more pronounced. I want to nest Duals as with the cases above, but the portion for the velocity values (and the corresponding terms in the Jacobian) is redundant. In this case, 3 values + 27 partials are computed twice, and this is what motivates my earlier comment.

I would like to write:

result = f([
    Dual(Dual(x, idx(1)), Dual(Dual(vx, idx(4)), Dual(ax, idx(7)))),
    Dual(Dual(y, idx(2)), Dual(Dual(vy, idx(5)), Dual(ay, idx(8)))),
    Dual(Dual(z, idx(3)), Dual(Dual(vz, idx(6)), Dual(az, idx(9)))),
])

I think this would solve my problem, but perhaps I am misusing the structure of Dual numbers. If that's so, please enlighten me.

jrevels commented 7 years ago

Evaluate with a position+velocity (6-state) vector:

This, of course, is extremely straightforward:

result = f([Dual(x, vx), Dual(y, vy), Dual(z, vz)])

Hmm. I don't understand this part, so maybe we should start there. The vector you're constructing here is:

[x + ϵ * v_x,
 y + ϵ * v_y,
 z + ϵ * v_z]

but it seems like you should rather be constructing:

[x + ϵ_x * v_x,
 y + ϵ_y * v_y,
 z + ϵ_z * v_z]

which, using Dual, looks like:

[Dual(x, v_x, 0.0, 0.0),
 Dual(y, 0.0, v_y, 0.0),
 Dual(z, 0.0, 0.0, v_z)]

Alternatively, if your function is f(x::Vector, t::Real)::Vector, and you'd like to take time derivatives, then you'd want to input f(x, t + ϵ_t) --> f(x, Dual(t, 1.0))

You can only get away with a single perturbation "shared" between all input dimensions if the function doesn't express any coupling/interdependence between the input dimensions, which is not generally true (but might be true for some vector-valued kinematics cases).

dbeach24 commented 7 years ago

No, it's more like I'm constructing:

[x + ϵ_t * vx,
 y + ϵ_t * vy,
 z + ϵ_t * vz]

where ϵ_t is the time-derivative.

For example, the function for converting spherical to cartesian for a combined position + velocity (6) state is:

function sphere_to_cart_6(sphere)
    (r, θ, ϕ, ∂r, ∂θ, ∂ϕ) = sphere

    x = r * sin(θ) * cos(ϕ)
    y = r * sin(θ) * sin(ϕ)
    z = r * cos(θ)

    ∂x = ∂r*sin(θ)*cos(ϕ) + r*(cos(θ)*cos(ϕ)*∂θ - sin(θ)*sin(ϕ)*∂ϕ)
    ∂y = ∂r*sinθsinϕ + r*(cos(θ)*sin(ϕ)*∂θ + sin(θ)*cos(ϕ)*∂ϕ)
    ∂z = ∂r*cos(θ) - r*sin(θ)*∂θ

    [x, y, z, ∂x, ∂y, ∂z]
end

But you could write just the 3-D version of this function, and let ForwardDiff compute the derivatives for you by constructing the Dual numbers I proposed earlier. And clearly, the components in the function mix, but that's okay because you actually want them to. Remember, vx is not a derivative w.r.t. x, it's a time derivative. Since vx, vy, and vz are all time-derivatives, only one partial component is needed, and the duals "mix" as intended.

ADDED: The next step is to talk about taking the Jacobian of the 6-D function, but first let's agree on what I'm doing here before we move on. For now, this is just about using Dual numbers to take a time derivative.

MORE: I think this is merely me using the "Differential Geometry" interpretation of dual numbers -- the simple kind with only one partial, like those available in the DualNumbers.jl package.

jrevels commented 7 years ago

Ah, I see where my confusion was - thanks for the clarification. I thought you were either trying to calculate positional derivatives (hence my suggestion about orthogonal perturbations) or trying to calculate velocities/accelerations from a time-dependent position function (hence my suggestion of f(x, t + ϵ_t)). Now I see that you're actually using the dual numbers to apply coordinate transformations to the input vector + its time derivatives simultaneously.

This makes sense now, I just hadn't thought of this use case before. Now the rest of your other comment makes sense to me.

Like I said before, I'd definitely be open to a PR which implements your suggestion to use two type parameters. We'll have to be careful about causing compile-time regressions, though, which I fear might be quite bad as we'll be needlessly exposing more type info to the compiler in many common cases.

dbeach24 commented 7 years ago

Giving this more thought, I'm convinced that while allowing the value and partial types to differ could still be interesting for some applications, it won't solve the problem I had in mind.

For example, consider the following multiplication (for which I have expanded two derivatives):

z = x * y
dz = y*dx + x*dy
ddz = x*ddy + y*ddx + 2*dx*dy

Representing x and y using dual numbers with different value and partial types, one might be tempted to write:

X = Dual(x, Dual(dx, ddx))
Y = Dual(y, Dual(dy, ddy))

Calculating Z, one gets:

Z = Dual(x*y, Dual(y*dx + x*dy, x*ddy + y*ddx))

Instead of (using Dual numbers correctly):

X = Dual(Dual(x, dx), Dual(dx, ddx))
Y = Dual(Dual(y, dy), Dual(dy, ddy))
Z = X * Y = Dual(Dual(x*y, dx*dy), Dual(dx*dy, x*ddy + y*ddx + 2*dx*dy))

So, the first is clearly wrong because it is missing the 2*dx*dy term from the 2nd derivative value, and (@jrevels) your earlier comment regarding missing cross-terms is spot-on. So while there is an inefficiency in evaluating the 2nd derivative using Dual numbers (i.e. computing the first derivative value twice), merely allowing the value and partials types to differ fails to properly solve that problem.

Perhaps what I want is an entirely different kind of hyper-dual number which is able to represent a chain of derivative values (i.e. DerivChain(x, dx, ddx, ...)). This type would need to respect the chain rule and product rule, for example, but should avoid redundant computations (such as computing dx*dy twice). I don't know how difficult it would be to code a generic implementation of this idea, but it's something I hope to think about more in my not-so-copious spare time.

jrevels commented 7 years ago

This type would need to respect the chain rule and product rule, for example, but should avoid redundant computations (such as computing dx*dy twice)

I think the best way forward (without implementing a new type) would be to simply implement this optimization for nested Duals by dispatching on Base.:*{N,M,A,B}(::Dual{N,Dual{M,A}}, ::Dual{N,Dual{M,B}}). In general, I'd love to see more optimized methods for nested Dual types.

dbeach24 commented 7 years ago

Wouldn't we need the guarantee that, for example:

x.value.partials[1] == x.partials[1].value

?

Is this true in all cases? How could this be known at compile-time dispatch?

EDIT: Also, wouldn't this still have the inefficiency that the first derivative values are being stored twice?

jrevels commented 7 years ago

Ah, you're right for multiplication. I'm pretty sure there are other higher order methods that could be further optimized w.r.t. nested Dual numbers, though LLVM might be smart enough to unroll and do the CSE for us in that case (especially since we inline scalar arithmetic aggressively).