Q: Applicability to object detection?

philferriere commented 3 years ago

Hi Frederick & Co,

Thank you for sharing your awesome work with us! I was wondering if you've already put some thought in how your approach can be extended to analyzing the quality of training data for object detection.

Your technique is fairly straightforward to understand in the context of image classification, where there is one classification head and one can easily find a layer close to the output (e.g., use the weights that connect the layer before the logit layer to the logit layer, as mentioned in the FAQ doc), allowing for the computation of a TracInCP score per frame. But how does one do this in the case of an object detector where there are typically two heads (one for classification, one for bounding box regression) and a varying number of objects per frame? Do you believe that object-level gradients for the objects in a training frame can be aggregated in a meaningful way (e.g., mean or max of gradients across objects for a frame?) such that one can still compute a meaningful dot product between aggregated loss gradients for a training frame and aggregated loss gradients for a test frame? Would this have to be done separately for classification and regression (e.g. a classification proponent may turn out to be a regression opponent, and that would be useful information to surface)?

Are you aware of similar attempts at extending your work to object detection?

Thank you for taking the time to share your thoughts on this, @frederick0329 !

frederick0329 commented 3 years ago

Thank you for the great question! Here are some conclusion from my exploration on COCO + SSD -

Calculate loss for only two kind of boxes for each image - predicted boxes and proposal boxes closest ground truth boxes. (Including other proposal boxes makes the results very hard to interpret.) I didn't treat each positive/negative box as an example because I don't see how that scales.
Split the problem into two - compute classification gradient and bounding box gradient and calculate proponents/opponents/high self influence for each. (Again, it's really hard to conclude when considered together.)
Results are not as clear as classification task. For example, high self-influence can not only be mis-labeled class but also bad proposal boxes (the prediction box does not have a positive label). There are still some examples I don't get why it has high self-influence.

In summary, this is an open area. I am not aware of any of work in this direction but I do think this is high impact work. One reason that I didn't continue is that object detectors come in many form, I don't have a good idea on how to adapt TracIn to fit all the variances.

Hope this helps and please do share if you make any progress on this.

philferriere commented 3 years ago

Hi again @frederick0329 ,

These are good tips! I'm still struggling a little bit coming up with a meaningful implementation/pseudo-code for a single stage SSD object detector. I'll try to write the pseudo-code over the next couple of days and hopefully you can guide me in the right direction if my approach doesn't really make sense.

Thank you again for all your help so far!

philferriere commented 3 years ago

Hi @frederick0329 !

All right... so, please bare with me as I try to capture why I'm failing to see a straight path to a working implementation for object detection using SSD. If anything below doesn't make sense, please don't hold back and feel free to correct me. Again, the context for this discussion is training data valuation, where I want to surface the frames that contribute the most to reducing validation/test loss.

For a quick refresher on SSD, if needed, here are two short, to-the-point posts that do a good job at nailing the essentials:

To kinda keep things tidy, let's assume we use an SSD model with num_maps feature maps, num_anchors fixed anchors per grid cell, box_encoding box encodings, and num_classes classes. Typically, num_maps would be between 4 and 6, num_anchors per grid cell between 3 and 6, box_encoding would be typically 4 (dx, dy, dw, dh, or some slight variation), and num_classes could be 10 or significantly more. Also, let's assume that our features are the multi-scale feature maps of a vanilla multi-res CNN backbone. They are upsampled to the largest scale feature maps and concatenated together into a nice [batch_size, H, W, num_channels] tensor (num_channels being typically between 256 and 1024).

Again, to keep things simple, let's assume we have two separate heads (one for bounding box regression, one for classification), each made of a simple Conv2D layer. The final model will issue predictions for each grid cell, hence giving us a regression tensor of size [batch_size, H, W, num_anchors, box_encoding] and a classification tensor of shape [batch_size, H, W, num_anchors, num_classes] with logits (no softmax for now).

For loss calculations, as you pointed out, we only pay attention to num_matched positive anchor box matches (at some arbitrary IoU btw ground truth boxes and our set of anchors), and compute two different losses, a regression loss L_reg and a classification loss L_cls between predicted boxes and ground truth boxes for the positive match. The final loss is basically (L_cls + alpha.L_reg)/num_matched, but let's ignore alpha for now. Note that num_matched is almost never the same across individual input images.

At the risk of stating the obvious, this setup is more complicated than the image classification scenario because:

the number of matched default boxes can change dramatically from image to image
we combine two different losses with different scales which might make disentangling 'influence' harder

Now, finding the most valuable training samples is actually trivial -- once one has found a good representation for loss gradients and activations in our scenario. Here's a straightforward implementation that (on purpose) borrows heavily from your own notebook naming conventions:

def find_high_value_samples_by_influence(tracin_train, tracin_eval):
  """
  Find the training samples with the most positive influence on a given validation/test set.
  Args:
    tracin_train: dict with loss gradients and activations for the training set
      tracin_train["reg_loss_grads"]: stacked regression loss gradients (N, 1, num_ckpts) floats
      tracin_train["cls_loss_grads"]: stacked classification loss gradients (N, num_classes, num_ckpts) floats
      tracin_train["activations"]: activations (N, H, W, num_channels, num_ckpts) floats
    tracin_eval: dict with loss gradients and activations for the validation set
      tracin_eval["reg_loss_grads"]: stacked regression loss gradients (N, 1, num_ckpts) floats
      tracin_eval["cls_loss_grads"]: stacked classification loss gradients (N, num_classes, num_ckpts) floats
      tracin_eval["activations"]: activations (N, H, W, num_channels, num_ckpts) floats
  Returns:
    regression_proponents: ordered training sample indices (list)
    classification_proponents: ordered training sample indices (list)
  Notes:
    N is the number of training samples, M is the number of eval samples
    H,W are the width and height of the largest feature map at the end of the CNN backbone
  """
  N, M = len(tracin_train["reg_loss_grads"]), len(tracin_eval["reg_loss_grads"])
  # To find the most valuable training samples, aggregate the LHS of the TracInCP equation.
  # Do this for regression and classification
  reg_scores, cls_scores = [], []
  for n in range(N):
    reg_score, cls_score = 0, 0
    for m in range(M):
      # Accumulate the scores of this training sample
      reg_score += tracin_score(tracin_train['reg_loss_grads'][n], tracin_train['activations'][n], tracin_eval['reg_loss_grads'][m], tracin_eval['activations'][m])
      cls_score += tracin_score(tracin_train['cls_loss_grads'][n], tracin_train['activations'][n], tracin_eval['cls_loss_grads'][m], tracin_eval['activations'][m])
    # Store the accumulated scores of this training sample
    reg_scores.append(reg_score)
    cls_scores.append(cls_score)
  # Rank all the training samples by their accumulated scores (most valuable first)
  regression_proponents = np.argsort(reg_scores)[::-1]  
  classification_proponents = np.argsort(cls_scores)[::-1]  
  return regression_proponents, classification_proponents

The challenge, obviously, isn't there. It's in trying to build our tracin_* vectors. Below is a first draft approach that, yes, would never work. However, I would like to see if we can use it to progressively get to a working implementation, by answering the four questions placed in the code comments below:

def run_obj_det(inputs):
  """
  Gets the loss gradients and activations for an `inputs` dataset
  Params:
    inputs: a shard of the input data
  Return:
    image ids (batch_size) int
    stacked regression loss gradients (batch_size, 1, num_ckpts) floats
    stacked classification loss gradients (batch_size, num_classes, num_ckpts) floats
    stacked activations (batch_size, H, W, num_channels, num_ckpts) floats
  """
  imageids, images, labels = inputs
  batch_size = len(images)
  # imageids: list of sample IDs (batch_size, int)
  # images: list of samples (batch_size, 224, 224, 3)
  # labels: list of labels (batch_size, list of box encodings and classes)

  # ckpt_reg_loss_grads: regression loss gradients for each checkpoint
  # ckpt_cls_loss_grads: classification loss gradients for each checkpoint
  # ckpt_activations: activations at the end of the CNN backbone for each checkpoint
  ckpt_reg_loss_grads, ckpt_cls_loss_grads, ckpt_activations = [], [], []

  # Loop as many times as there are checkpoints
  # For each model checkpoint, get the common input endpoint to the detection heads,
  # and the output endpoints of each detection head
  for mp, ml_reg, ml_cls in zip(models_penultimate, models_reg_last, models_cls_last):

    # mp: model endpoint for the concatenation layer at the end of the CNN backbone
    # ml_reg: model endpoint for the output of the Conv2D layer that is the regression head
    # ml_cls: model endpoint for the output of the Conv2D layer that is the classification head

    # For the batch of input images, get the common input to the detection heads (concatenaded CNN features)
    h = mp(images)  # reg: (batch_size, H, W, num_channels) floats

    # Get the predictions (regressed values and logits) at each grid cell
    pred_reg = ml_reg(h)  # (batch_size, H, W, num_anchors, box_encoding) floats
    pred_cls = ml_cls(h)  # (batch_size, H, W, num_anchors, num_classes) floats

    # Only use the positive matches for each input sample
    pred_reg_list, gt_reg_list, pred_cls_list, gt_cls_list = matched(labels, anchors, pred_reg, pred_cls)

    # pred_reg_list: list with batch_size entries, each predicted entry of the shape (num_matches, box_encoding), num_matches differs for each entry
    # gt_reg_list: list with batch_size entries, each gt entry of the shape (num_matches, box_encoding), num_matches differs for each entry
    # pred_cls_list: list with batch_size entries, each predicted entry of the shape (num_matches, num_classes), num_matches differs for each entry
    # gt_cls_list: list with batch_size entries, each gt entry of the shape (num_matches, num_classes), num_matches differs for each entry

    reg_loss_grads, cls_loss_grads = [], []
    for b in range(batch_size):
      # Get the regression loss gradient for the positive matches in this image
      y_true = gt_reg_list[b]    # (num_matches, box_encoding) floats
      y_pred = pred_reg_list[b]  # (num_matches, box_encoding) floats
      num_matches = len(y_true)  # get the number of matches int
      # In real life, we'd use a smooth L1, but you get the idea...
      loss_grad = ts.abs(y_true - y_pred)
      # Q1a: what's the most appropriate aggregation function for regression, here?
      # Q1b: is the below the best encoding for a frame-representative regression loss gradient?
      avg_loss_grad = tf.sum(loss_grad) / num_matches # (1,)
      reg_loss_grads.append(avg_loss_grad)

      # Get the classification loss gradient for the positive matches in this image
      # Note that we don't penalize wrong classifications for negative matches below, but we should...
      labels_gt = gt_cls_list[b]    # (num_matches, num_classes)
      logits_pred = pred_cls_list[b]  # (num_matches, num_classes)
      probs = tf.nn.softmax(logits_pred)   # (num_matches, num_classes probas) floats
      loss_grad = tf.one_hot(labels_gt, num_classes) - probs   # (num_matches, num_classes) floats
      # Q2a: what's the most appropriate aggregation function for classification, here?
      # Q2b: is the below the best encoding for a frame-representative classification loss gradient?
      avg_loss_grad = tf.sum(loss_grad) / num_matches # (num_classes,)
      cls_loss_grads.append(avg_loss_grad)

    # Save the (batch_size, H, W, num_channels) activations at the end of the CNN backbone
    # Save the (batch_size, 1) regression loss gradients
    # Save the (batch_size, num_classes) classification loss gradients
    ckpt_activations.append(h)  
    ckpt_reg_loss_grads.append(reg_loss_grads)
    ckpt_cls_loss_grads.append(cls_loss_grads)

  return imageids, tf.stack(ckpt_reg_loss_grads, axis=-1), tf.stack(ckpt_cls_loss_grads, axis=-1), tf.stack(ckpt_activations, axis=-1)

This may seem like a lot of code, but most of it is actually wordy comments ;)

This is a lot to ask, Frederick, I realize it. Still, I'm hoping you'll be willing to take a quick look and share your take/thoughts on the above. Am I on the right track? Is this in line with what you had tried before? Or, did I completely miss the mark?

As always, many thanks for your time and support!

Cheers, -- Phil

frederick0329 commented 3 years ago

Disclaimer: I do not have the right solution either :) At a high level,

```
# Only use the positive matches for each input sample
pred_reg_list, gt_reg_list, pred_cls_list, gt_cls_list = matched(labels, anchors, pred_reg, pred_cls)
```
This is a good start, but we might be missing the negative influence examples here. I think we should try both pos and neg and see what the downstream task wants.
Q1, Q2: I guess you are trying to do example level here as oppose to box level. I find it extremely challenging to interpret when we aggregate all boxes. In terms of aggregation, the loss here should be the same as how you aggregate loss during training. When I narrow down to boxes, I just added masks to filter out boxes I don't want but the aggregation logic is the same (the one used in training).

p.s. It might be easier next time if you can paste the code somewhere where we can comment and discuss.

frederick0329 / TracIn

Q: Applicability to object detection? #3