Is your feature request related to a problem? Please describe.

When using the first-order-influence-koh-liang branch I have some trouble when I want to compute the exact inverse hessian product on a semantic segmentation model. Here is a minimal example and the corresponding outpout logs that I got:

import tensorflow as tf

from influenciae.common.model_wrappers import InfluenceModel
from influenciae.influence.inverse_hessian_vector_product import ExactIHVP

IMG_SIZE = 768

inp = tf.keras.Input(shape=(IMG_SIZE, IMG_SIZE, 3))

# A conv block
x = tf.keras.layers.Conv2D(filters=32, kernel_size=1, strides=(1, 1))(inp)
x = tf.keras.layers.Dropout(0.2)(x)
x = tf.keras.layers.BatchNormalization()(x)
x = tf.keras.layers.Activation('relu')(x)
# FCN block
x = tf.keras.layers.UpSampling2D(
    size=(IMG_SIZE // x.shape[1], IMG_SIZE// x.shape[2]),
model_output = tf.keras.layers.Conv2D(NUM_CLASSES, kernel_size=(1, 1), padding="same")(x)

# define model
model = tf.keras.Model(inputs=inp, outputs=model_output)
# freeze all layers except last one
for layer in model.layers:
    layer.trainable = False
for layer in model.layers[-1:]:
    layer.trainable = True
# define a loss for semantic segmentation fitting reduction None
class CustomLoss2(tf.keras.losses.Loss):

    def __init__(self, num_classes, ignore_label):
        super(CustomLoss2, self).__init__(name='CustomLoss2', reduction=tf.keras.losses.Reduction.NONE)

        self.num_classes = num_classes
        self.ignore_label = ignore_label

    def call(self, y_true, y_pred):

        sample_weights = tf.cast(tf.not_equal(y_true, self.ignore_label), dtype=tf.float32)
        one_hot_gt = tf.stop_gradient(tf.one_hot(y_true, self.num_classes))

        loss = tf.nn.softmax_cross_entropy_with_logits(one_hot_gt, y_pred)
        weighted_loss = tf.multiply(loss, tf.squeeze(sample_weights))

        # Compute mean loss over spatial dimension.
        num_non_zero = tf.reduce_sum(
            tf.cast(tf.not_equal(weighted_loss, 0.0), tf.float32), 1)
        loss_sum_per_sample = tf.reduce_sum(weighted_loss, 1)
        return tf.reduce_sum(tf.math.divide_no_nan(loss_sum_per_sample, num_non_zero), 1)

if __name__ == "__main__":
    random_input = tf.random.normal(shape=(4, IMG_SIZE, IMG_SIZE, 3))
    random_target = tf.random.uniform(shape=(4, IMG_SIZE, IMG_SIZE), minval=0, maxval=NUM_CLASSES-1, dtype=tf.int32)

    random_dataset =, random_target))

    # define InfluenceModel
    influence_model = InfluenceModel(model, target_layer=-1, loss_function=CustomLoss2(NUM_CLASSES, ignore_label=255))
    # freeze all layers except last one
    for layer in influence_model.layers:
        layer.trainable = False
    for layer in influence_model.layers[-1:]:
        layer.trainable = True
    ihvp_calculator = ExactIHVP(influence_model, random_dataset.take(1).batch(1))


As you can see, I face an OOM issue when trying to allocate a tensor with shape [640, 1, 768, 768, 32]. 640 is the number of weights (so basically the gradient vector size) 1 the number of inputs and [768, 768, 32] is the size of the input ONCE he got through all the layers except the last one. And as you might notice, this vector is allocated when we try to do:

hess = tf.squeeze(tape_hess.jacobian(grads, weights))

In the function _compute_inv_hessian in the file.

Describe the solution you'd like

I know that to compute the hessian we need this vector. But I was wondering if we cannot split this vector among the grads dim and my colleague @dv-ai has found out a workaround solution if you make some little change in the _compute_inv_hessian function:


  def _compute_inv_hessian(self, dataset: -> tf.Tensor:
      Compute the (pseudo)-inverse of the hessian matrix wrt to the model's parameters using backward-mode AD.

      Disclaimer: this implementation trades memory usage for speed, so it can be quite memory intensive, especially
      when dealing with big models.

              A TF dataset containing the whole or part of the training dataset for the computation of the inverse
              of the mean hessian matrix.

          A tf.Tensor with the resulting inverse hessian matrix
      weights = self.model.weights
      with tf.GradientTape(persistent=False, watch_accessed_variables=False) as tape_hess:

          grads = self.model.batch_gradient(dataset) if dataset._batch_size == 1 \
              else self.model.batch_jacobian(dataset)

      hess = tf.squeeze(tape_hess.jacobian(grads, weights))
      hessian = tf.reduce_mean(tf.reshape(hess, (-1, int(tf.reduce_prod(weights.shape)), int(tf.reduce_prod(weights.shape)))), axis=0)

      return tf.linalg.pinv(hessian)


  def _compute_inv_hessian(self, dataset: -> tf.Tensor:
      Compute the (pseudo)-inverse of the hessian matrix wrt to the model's parameters using
      backward-mode AD.

      Disclaimer: this implementation trades memory usage for speed, so it can be quite
      memory intensive, especially when dealing with big models.

          A TF dataset containing the whole or part of the training dataset for the
          computation of the inverse of the mean hessian matrix.

          A tf.Tensor with the resulting inverse hessian matrix
      weights = self.model.weights
      with tf.GradientTape(persistent=True, watch_accessed_variables=False) as tape_hess:

          grads = self.model.batch_gradient(dataset) if dataset._batch_size == 1 \
              else self.model.batch_jacobian(dataset) # pylint: disable=W0212

      hess = tf.squeeze(tape_hess.jacobian(grads, weights, parallel_iterations=10, experimental_use_pfor=False))

      hessian = tf.reduce_mean(tf.reshape(hess,
                                          (-1, int(tf.reduce_prod(weights.shape)),
                                           int(tf.reduce_prod(weights.shape)))), axis=0)

      return tf.linalg.pinv(hessian)

By changing: persistent to True and by setting in the .jacobian call the parameters: parallel_iterations=10 and experimental_use_pfor=False the computation is done.

N.B: 10 is not important as long it is a natural divider of the number of grads length (unfortunate for prime number though)

See if I add to my script:


I got:

[[ 2.3457441  -0.07872738 -0.11368337 ...  0.02131678  0.02238739
 [-0.07837234  2.576137   -0.12778574 ...  0.02324321  0.02715976
 [-0.11375846 -0.12770845  2.8135462  ...  0.02000072  0.0255051
 [ 0.02132007  0.02319163  0.01998054 ...  0.7005969  -0.01072854
  -0.0270703 ]
 [ 0.02241289  0.02717561  0.02547131 ... -0.0106203   0.87094194
 [ 0.04103031  0.03757853  0.03215647 ... -0.02701988 -0.0346096
   0.77158403]], shape=(640, 640), dtype=float32)

The computation still take some times but that make sense since there is a lot of parameters. Is there any way to set those parameters in the constructor or at least when calling _compute_inv_hessian. Or otherwise, to automatically split the computation over the different gradients ?

Additional remarks While doing those experimentations I also noticed a few thing: