Store Hessian/Jacobian Expression graph?

trautman commented 5 years ago

Hi all!

Thank you for the wonderful product! I'm using autograd to do newton optimization of a convex function (so Hessian is definite). However, the computation of the Hessian at each iteration of the newton method is prohibitive. For example, in the following code

def newton(ll, z0): grad_llambda = ag.grad(ll) hess_llambda = ag.hessian(ll)

    z = z0
    for _ in range(max_iter):

        t0 = time.time()
        gll = grad_llambda(z)
        hll = hess_llambda(z)
        print('hess and grad time', time.time()-t0)

        t0 = time.time()
        delta = np.linalg.solve(hll, -gll)
        print('solve time', time.time()-t0)

        z = z + delta
        if np.linalg.norm(delta[0:T]) < tol:
            break
        elif _ >= (max_iter-4):
            print('ERROR: NEWTON FAILED', _)    
    return z

I see "Hess and grad times" at each iteration step around 1.14 seconds and "solve times" of around 0.001 seconds.

If I were to handcode my Hessian (highly undesirable), I can't imagine hess_llambda(x) taking 1 second---the hessian is only size 325x325, and the objective function is not that complicated.

I'm guessing the computation graph of the Hessian is being recomputed each time I call hess_llambda(z), but I'm not sure, and I don't really know how to store the expression graph. Is there a way to do this?

Also, I've tried placing

grad_llambda = ag.grad(ll) hess_llambda = ag.hessian(ll)

outside of the function definition, and this doesn't change anything.

Thank you!

CamDavidsonPilon commented 5 years ago

A matrix of 325 x 325 is not trivial, and could take some time to compute - though 1 second+ does seem a bit high. Can you post your log-likelihood as well?

A parallel solution is to use a hessian-free optimization routine, like scipy's (L-)BGFS methods. These methods approximate the hessian, and though take more steps, typically are as fast or faster. Ex:

from autograd import value_and_grad
from scipy.optimize import minimize

minimize(value_and_grad(ll), z0, jac=True, method='BGFS')

trautman commented 5 years ago

Hi Cameron!

Thank you for your feeback; Here is the objective function; for context, T = 25 and ess = 13 (e.g. 13*25 = 325).

In the meantime, I'll give BFGS a try.

def log_lambda_diag_slice_x(x):

    T = np.size(robot_mu_x)

    quad_robot_mu_x = np.dot((x[:T]-robot_mu_x).T, np.dot(inv_robot_cov_x, \
                              x[:T]-robot_mu_x))

    llambda = -0.5*quad_robot_mu_x 

    n = 1
    for ped in range(math.ceil(ess)):
        top = top_Z_indices[ped]

            quad_ped_mu_x[ped] = np.dot((x[n*T:(n+1)*T] - \
                 mu_linear_conditioned_x[top][buffer:]).T, \
                 np.dot(inv_ped_cov_x[top], x[n*T:(n+1)*T] - \
                 mu_linear_conditioned_x[top][buffer:]))
        n = n + 1
        llambda = llambda - 0.5*quad_ped_mu_x[ped] 

    n = 1
    for ped in range(math.ceil(ess)):
        top = top_Z_indices[ped]
        log_znot_x[ped] = 0.

        quad_robot_ped_x[ped] = np.multiply(np.power(\
                             x[:T]-x[n*T:(n+1)*T], 2), inv_var_sum_x[top])
        log_znot_x[ped] = log_znot_x[ped] + np.sum(np.log1p(\
                                      -np.exp(-0.5*quad_robot_ped_x[ped])))
        n = n + 1
        llambda = llambda + log_znot_x[ped]

    return -1.*llambda

trautman commented 5 years ago

Its also worth noting that this Hessian is sparse---all the mixed derivatives with respect to ped_i and ped_j are 0 (only the rows corresponding to the robot's second derivatives are non-zero). So the number of non-zero entries is 25 13 25 + 13*25 = 8450. The Hessian is of size 325**2 = 105625.

Does autograd exploit sparsity all? I would expect the computation graph to naturally pick up on this, but I'm not sure.

CamDavidsonPilon commented 5 years ago

I don't believe autograd detect/cares if the matrix is sparse. (I could be wrong, I'm not a core developer of autograd).

Your likelihood has some for loops and vector assignments. I would try to iron out both of these, as they are likely the cause of any slowdown.

Did the BFGS solution lead anywhere?

trautman commented 5 years ago

Hi Cameron!

thank you for your response; BFGS did perform better (both in solution quality and in speed), although the CPU usage went way up. This algorithm is for a power constrained robot so using 3 high power cores is just not in the cards. In any case, its a great backup and alternative! I learned something useful which I appreciate.

I agree that the for loops and vector assignments are likely causing a slowdown, and I do have some ironing out to do.

Thank you!

CamDavidsonPilon commented 5 years ago

Heck, if you think your likelihood has no local solutions (I'm not smart enough to identify if it does), you can even try gradient descent, which will be pretty fast (though test it) and very memory cheap. This won't use the second derivative at all, or any approximation of it.

trautman commented 5 years ago

Actually, the likelihood is convex, and so not only is the local solution the global solution, but the global optima is guaranteed to be very close to the initialization (this is a separate proof).

this is part of the reason that I'm confused that a Newton method doesn't take like, 0.0000001 seconds. I can usually get the delta correction to within an extremely small tolerance within 1 or 2 steps of newton. I've derived the Hessian and its pretty benign.

I'm going to handcode the Hessian in the meantime because the solution should be essentially instantaneous (this also allows me to ignore a large number of computations, e.g., the zeros.). There's just not many computations happening. Certainly not >1 seconds worth of computations.

But autograd is so convenient! it's just not working correctly for my application. Or I'm not using autograd correctly. Since I know that autograd is the correct solution, I can at least use it to verify my hand-coded hessian (which will surely have sign errors and other errors).

HIPS / autograd

Store Hessian/Jacobian Expression graph? #492