HIPS / autograd

Efficiently computes derivatives of NumPy code.
MIT License
6.94k stars 905 forks source link

Hessian diagonal computation #445

Open MaxHalford opened 5 years ago

MaxHalford commented 5 years ago

Hello,

I've been playing around with autograd and I'm having a blast. However I'm having some difficulty with extracting the diagonal of the Hessian.

This is my current code:

from autograd import hessian
import autograd.numpy as np 

y_pred = np.array([
    [1, 0, 0, 0, 0],
    [0, 1, 0, 0, 0],
    [0, 0, 0, 1, 0],
    [0, 0, 1, 0, 0],
    [0, 0, 0, 0, 1],
    [0, 0, 0, 0, 1],
    [0, 0, 0, 0, 1]
], dtype=float)

weights = np.array([1, 1, 1, 1, 1], dtype=float)

def softmax(x, axis=1):
    z = np.exp(x)
    return z / np.sum(z, axis=axis, keepdims=True)

def loss(y_pred):

    y_true = np.array([
        [1, 0, 0, 0, 0],
        [0, 1, 0, 0, 0],
        [0, 0, 0, 1, 0],
        [0, 0, 1, 0, 0],
        [0, 0, 0, 0, 1],
        [0, 0, 0, 0, 1],
        [0, 0, 0, 0, 1]
    ], dtype=float)

    ys = np.sum(y_true, axis=0)
    y_true = y_true / ys
    ln_p = np.log(softmax(y_pred))
    wll = np.sum(y_true * ln_p, axis=0)
    loss = -np.dot(weights, wll)
    return loss

hess = hessian(loss)(y_pred)

I understand that hessian is simply jacobian called twice and that hess is an n * p * n * p matrix. I can extract the diagonal manually and obtain my expected output which is:

[[0.24090069 0.12669198 0.12669198 0.12669198 0.12669198]
 [0.12669198 0.24090069 0.12669198 0.12669198 0.12669198]
 [0.12669198 0.12669198 0.12669198 0.24090069 0.12669198]
 [0.12669198 0.12669198 0.24090069 0.12669198 0.12669198]
 [0.04223066 0.04223066 0.04223066 0.04223066 0.08030023]
 [0.04223066 0.04223066 0.04223066 0.04223066 0.08030023]
 [0.04223066 0.04223066 0.04223066 0.04223066 0.08030023]]

I've checked this numerically and it's fine. The problem is that this still requires computing the full Hessian before accessing the diagonal part, which is really expensive. Is there any better way to proceed? I think this is a common use case in machine learning optimization that could deserve a dedicated convenience function

neonwatty commented 5 years ago

EDIT: THIS PROPOSED SOLUTION DOES NOT WORK IN GENERAL - see dougalm's answer below for explanation

Just provided a solution for this over at #417 - small world.

Below is an example showing how to compute the gradient and just the diagonal of the Hessian (using elementwise_grad twice).

from autograd import numpy as np
from autograd import elementwise_grad

# an example function
g = lambda w: 5*w[0]**2 + 7*w[1]**2

# define a test point
w = np.array([2.0,2.0])

# compute gradient function, then evaluate
nabla_g = elementwise_grad(g)
print ('gradient at test point = ' + str(nabla_g(w)))

# just compute pure second derivatives - the diagonal of the hessian - without computing full hessian first
diag_hess_g = elementwise_grad(nabla_g)
print ('diagonal of hessian at test point = ' + str(diag_hess_g(w)))

PS: you can check out the entire list of built in differential operators here.

MaxHalford commented 5 years ago

Hey, thanks a lot for the answer.

I though about doing that but I find the results surprising.

Here is the updated example using the approach you're proposing.

from autograd import elementwise_grad
import autograd.numpy as np 

y_true = np.array([
    [1, 0, 0, 0, 0],
    [0, 1, 0, 0, 0],
    [0, 0, 0, 1, 0],
    [0, 0, 1, 0, 0],
    [0, 0, 0, 0, 1],
    [0, 0, 0, 0, 1],
    [0, 0, 0, 0, 1]
], dtype=float)

y_pred = np.array([
    [1, 0, 0, 0, 0],
    [0, 1, 0, 0, 0],
    [0, 0, 0, 1, 0],
    [0, 0, 1, 0, 0],
    [0, 0, 0, 0, 1],
    [0, 0, 0, 0, 1],
    [0, 0, 0, 0, 1]
], dtype=float)

weights = np.array([1, 1, 1, 1, 1], dtype=float)

def softmax(x, axis=1):
    z = np.exp(x)
    return z / np.sum(z, axis=axis, keepdims=True)

def loss(y_pred):

    y_true = np.array([
        [1, 0, 0, 0, 0],
        [0, 1, 0, 0, 0],
        [0, 0, 0, 1, 0],
        [0, 0, 1, 0, 0],
        [0, 0, 0, 0, 1],
        [0, 0, 0, 0, 1],
        [0, 0, 0, 0, 1]
    ], dtype=float)

    ys = np.sum(y_true, axis=0)
    y_true = np.divide(y_true, ys)
    ln_p = np.log(softmax(y_pred))
    wll = np.sum(y_true * ln_p, axis=0)
    loss = -np.dot(weights, wll)
    return loss

nabla_g = elementwise_grad(loss)

diag_hess_g = elementwise_grad(nabla_g)

print(diag_hess_g(y_pred))

This returns:

[[7.54474768e-17 2.77555756e-17 2.77555756e-17 2.77555756e-17
  2.77555756e-17]
 [2.77555756e-17 7.54474768e-17 2.77555756e-17 2.77555756e-17
  2.77555756e-17]
 [2.77555756e-17 2.77555756e-17 2.77555756e-17 7.54474768e-17
  2.77555756e-17]
 [2.77555756e-17 2.77555756e-17 7.54474768e-17 2.77555756e-17
  2.77555756e-17]
 [6.93889390e-18 6.93889390e-18 6.93889390e-18 6.93889390e-18
  1.88618692e-17]
 [6.93889390e-18 6.93889390e-18 6.93889390e-18 6.93889390e-18
  1.88618692e-17]
 [6.93889390e-18 6.93889390e-18 6.93889390e-18 6.93889390e-18
  1.88618692e-17]]

which isn't the result I'm expecting (I provided the expected output in my initial post). Instead it's basically a matrix full of zeros. I'm not saying you're wrong but I'm pretty sure of my expected output. First of all I did the same thing with Tensorflow and it worked fine. Secondly when I used it in my machine learning routine it worked very well. On the other hand using the output obtained by using elementwise_grad twice yields very poor results so it "seems" wrong.

MaxHalford commented 5 years ago

Here is the code I'm using to extract the diagonal part.

hess = hessian(loss)

H = hess(y_pred)

diag = np.array([
    H[i, j, i, j]
    for j in range(H.shape[1])
    for i in range(H.shape[0])
])

diag.reshape(y_pred.shape, order='F')
neonwatty commented 5 years ago

Side-question (from looking a bit closer at your original code): are you trying to differentiate the function

loss

with respect to weights? Right now - the way you have implemented loss the autograd package will - by default - differentiate it with respect to the first input argument y_pred. The default autograd functionality (regardless of differential operator - e.g., grad or hessian) differentiates functions based on the first input.

If you want to differentiate with respect to weights you can write your function like shown below

def loss(weights,y_pred):

    y_true = np.array([
        [1, 0, 0, 0, 0],
        [0, 1, 0, 0, 0],
        [0, 0, 0, 1, 0],
        [0, 0, 1, 0, 0],
        [0, 0, 0, 0, 1],
        [0, 0, 0, 0, 1],
        [0, 0, 0, 0, 1]
    ], dtype=float)

    ys = np.sum(y_true, axis=0)
    y_true = y_true / ys
    ln_p = np.log(softmax(y_pred))    
    wll = np.sum(y_true * ln_p, axis=0)
    loss = -np.dot(weights, wll)

    return loss

Since your weights have length 5, your gradient will also have length 5 and your Hessian will be a (5x5) matrix. Then - I assume you're computing the Hessian to perform Newton's method to tune the weights (?) - you're good to go (i.e., you won't need to take the 'diagonal of the Hessian').

MaxHalford commented 5 years ago

Hey,

Nope I'm trying to differentiate w.r.t. to y_pred. And no the idea is that I only need the diagonal of the Hessian for gradient boosting. Computing the full Hessian is too costly.

dougalm commented 5 years ago

Unfortunately, I don't think it's possible to compute the diagonal of the Hessian other than by taking N separate Hessian-vector products, equivalent to instantiating the full Hessian and then taking the diagonal. People resort to all sorts of tricks to estimate the trace of the Hessian (e.g. https://arxiv.org/abs/1802.03451) precisely because it's expensive to evaluate the diagonal.

Autograd's elementwise_grad has a very important caveat: it only applies to functions for which the Jacobian is diagonal. All it does is a vector-Jacobian product with a vector of ones, which gives you the sum of each row of the Jacobian. If the Jacobian is diagonal, then that's the same thing as the diagonal of the Jacobian.

That caveat was in the docstring of an earlier version of elementwise_grad. But at some point we deleted the function (because it can be misleading!) and then later we reinstated it, without the docstring. I just added the caveat back in. Sorry for the confusion.

dougalm commented 5 years ago

Sorry, I meant to say "sum of each column". I was thinking of forward mode (which we should probably be using in elementwise_grad anyway).

neonwatty commented 5 years ago

How stupid of a work-around (not invoking jacobian elementwise_grad / trying to avoid the diagonal Jacobian restriction / trying to avoid computing the second cross-partials) would it be to loop over the input arguments and create the pure second partials one-at-a-time using grad? Supposing the input arguments are unpacked - and computing the j^{th} pure second partial as grad(grad(g,(j)),(j))?

I'm assuming pretty stupid, but for example

from autograd import grad

# a test function
g = lambda w0,w1: 5*w0**2 + 7*w1**2 + w0*w1

# a test point
w0 = 2.0; w1 = 2.0

# construct second pure partials one at-a-time
second_pure_partials = [grad(grad(g,(j)),(j)) for j in range(num_partials)]

# evaluate second pure partials one at-a-time
diag_hess = lambda w0,w1: [part(w0,w1) for part in second_pure_partials]
print ('pure second derivatives at test point = ' + str(diag_hess(w0,w1)))

`pure second derivatives at test point = [10.0, 14.0]`

compare to the "incorrect" answer provided by composing elementwise_grad in an attempt to get at the pure partials


bad_hess_diag = elementwise_grad(elementwise_grad(g,(0,1)),(0,1))
print ('"incorrect" pure second derivatives at test point = ' + str(bad_hess_diag(w0,w1)))

`"incorrect" pure second derivatives at test point = (array(11.), array(15.))`