Open philferriere opened 3 years ago
Thank you for the great question! Here are some conclusion from my exploration on COCO + SSD -
In summary, this is an open area. I am not aware of any of work in this direction but I do think this is high impact work. One reason that I didn't continue is that object detectors come in many form, I don't have a good idea on how to adapt TracIn to fit all the variances.
Hope this helps and please do share if you make any progress on this.
Hi again @frederick0329 ,
These are good tips! I'm still struggling a little bit coming up with a meaningful implementation/pseudo-code for a single stage SSD object detector. I'll try to write the pseudo-code over the next couple of days and hopefully you can guide me in the right direction if my approach doesn't really make sense.
Thank you again for all your help so far!
Hi @frederick0329 !
All right... so, please bare with me as I try to capture why I'm failing to see a straight path to a working implementation for object detection using SSD. If anything below doesn't make sense, please don't hold back and feel free to correct me. Again, the context for this discussion is training data valuation, where I want to surface the frames that contribute the most to reducing validation/test loss.
For a quick refresher on SSD, if needed, here are two short, to-the-point posts that do a good job at nailing the essentials:
To kinda keep things tidy, let's assume we use an SSD model with num_maps
feature maps, num_anchors
fixed anchors per grid cell, box_encoding
box encodings, and num_classes
classes. Typically, num_maps
would be between 4 and 6, num_anchors
per grid cell between 3 and 6, box_encoding
would be typically 4 (dx, dy, dw, dh, or some slight variation), and num_classes
could be 10 or significantly more. Also, let's assume that our features are the multi-scale feature maps of a vanilla multi-res CNN backbone. They are upsampled to the largest scale feature maps and concatenated together into a nice [batch_size, H, W, num_channels]
tensor (num_channels
being typically between 256 and 1024).
Again, to keep things simple, let's assume we have two separate heads (one for bounding box regression, one for classification), each made of a simple Conv2D
layer. The final model will issue predictions for each grid cell, hence giving us a regression tensor of size [batch_size, H, W, num_anchors, box_encoding]
and a classification tensor of shape [batch_size, H, W, num_anchors, num_classes]
with logits (no softmax for now).
For loss calculations, as you pointed out, we only pay attention to num_matched
positive anchor box matches (at some arbitrary IoU btw ground truth boxes and our set of anchors), and compute two different losses, a regression loss L_reg
and a classification loss L_cls
between predicted boxes and ground truth boxes for the positive match. The final loss is basically (L_cls + alpha.L_reg)/num_matched
, but let's ignore alpha for now. Note that num_matched
is almost never the same across individual input images.
At the risk of stating the obvious, this setup is more complicated than the image classification scenario because:
Now, finding the most valuable training samples is actually trivial -- once one has found a good representation for loss gradients and activations in our scenario. Here's a straightforward implementation that (on purpose) borrows heavily from your own notebook naming conventions:
def find_high_value_samples_by_influence(tracin_train, tracin_eval):
"""
Find the training samples with the most positive influence on a given validation/test set.
Args:
tracin_train: dict with loss gradients and activations for the training set
tracin_train["reg_loss_grads"]: stacked regression loss gradients (N, 1, num_ckpts) floats
tracin_train["cls_loss_grads"]: stacked classification loss gradients (N, num_classes, num_ckpts) floats
tracin_train["activations"]: activations (N, H, W, num_channels, num_ckpts) floats
tracin_eval: dict with loss gradients and activations for the validation set
tracin_eval["reg_loss_grads"]: stacked regression loss gradients (N, 1, num_ckpts) floats
tracin_eval["cls_loss_grads"]: stacked classification loss gradients (N, num_classes, num_ckpts) floats
tracin_eval["activations"]: activations (N, H, W, num_channels, num_ckpts) floats
Returns:
regression_proponents: ordered training sample indices (list)
classification_proponents: ordered training sample indices (list)
Notes:
N is the number of training samples, M is the number of eval samples
H,W are the width and height of the largest feature map at the end of the CNN backbone
"""
N, M = len(tracin_train["reg_loss_grads"]), len(tracin_eval["reg_loss_grads"])
# To find the most valuable training samples, aggregate the LHS of the TracInCP equation.
# Do this for regression and classification
reg_scores, cls_scores = [], []
for n in range(N):
reg_score, cls_score = 0, 0
for m in range(M):
# Accumulate the scores of this training sample
reg_score += tracin_score(tracin_train['reg_loss_grads'][n], tracin_train['activations'][n], tracin_eval['reg_loss_grads'][m], tracin_eval['activations'][m])
cls_score += tracin_score(tracin_train['cls_loss_grads'][n], tracin_train['activations'][n], tracin_eval['cls_loss_grads'][m], tracin_eval['activations'][m])
# Store the accumulated scores of this training sample
reg_scores.append(reg_score)
cls_scores.append(cls_score)
# Rank all the training samples by their accumulated scores (most valuable first)
regression_proponents = np.argsort(reg_scores)[::-1]
classification_proponents = np.argsort(cls_scores)[::-1]
return regression_proponents, classification_proponents
The challenge, obviously, isn't there. It's in trying to build our tracin_*
vectors. Below is a first draft approach that, yes, would never work. However, I would like to see if we can use it to progressively get to a working implementation, by answering the four questions placed in the code comments below:
def run_obj_det(inputs):
"""
Gets the loss gradients and activations for an `inputs` dataset
Params:
inputs: a shard of the input data
Return:
image ids (batch_size) int
stacked regression loss gradients (batch_size, 1, num_ckpts) floats
stacked classification loss gradients (batch_size, num_classes, num_ckpts) floats
stacked activations (batch_size, H, W, num_channels, num_ckpts) floats
"""
imageids, images, labels = inputs
batch_size = len(images)
# imageids: list of sample IDs (batch_size, int)
# images: list of samples (batch_size, 224, 224, 3)
# labels: list of labels (batch_size, list of box encodings and classes)
# ckpt_reg_loss_grads: regression loss gradients for each checkpoint
# ckpt_cls_loss_grads: classification loss gradients for each checkpoint
# ckpt_activations: activations at the end of the CNN backbone for each checkpoint
ckpt_reg_loss_grads, ckpt_cls_loss_grads, ckpt_activations = [], [], []
# Loop as many times as there are checkpoints
# For each model checkpoint, get the common input endpoint to the detection heads,
# and the output endpoints of each detection head
for mp, ml_reg, ml_cls in zip(models_penultimate, models_reg_last, models_cls_last):
# mp: model endpoint for the concatenation layer at the end of the CNN backbone
# ml_reg: model endpoint for the output of the Conv2D layer that is the regression head
# ml_cls: model endpoint for the output of the Conv2D layer that is the classification head
# For the batch of input images, get the common input to the detection heads (concatenaded CNN features)
h = mp(images) # reg: (batch_size, H, W, num_channels) floats
# Get the predictions (regressed values and logits) at each grid cell
pred_reg = ml_reg(h) # (batch_size, H, W, num_anchors, box_encoding) floats
pred_cls = ml_cls(h) # (batch_size, H, W, num_anchors, num_classes) floats
# Only use the positive matches for each input sample
pred_reg_list, gt_reg_list, pred_cls_list, gt_cls_list = matched(labels, anchors, pred_reg, pred_cls)
# pred_reg_list: list with batch_size entries, each predicted entry of the shape (num_matches, box_encoding), num_matches differs for each entry
# gt_reg_list: list with batch_size entries, each gt entry of the shape (num_matches, box_encoding), num_matches differs for each entry
# pred_cls_list: list with batch_size entries, each predicted entry of the shape (num_matches, num_classes), num_matches differs for each entry
# gt_cls_list: list with batch_size entries, each gt entry of the shape (num_matches, num_classes), num_matches differs for each entry
reg_loss_grads, cls_loss_grads = [], []
for b in range(batch_size):
# Get the regression loss gradient for the positive matches in this image
y_true = gt_reg_list[b] # (num_matches, box_encoding) floats
y_pred = pred_reg_list[b] # (num_matches, box_encoding) floats
num_matches = len(y_true) # get the number of matches int
# In real life, we'd use a smooth L1, but you get the idea...
loss_grad = ts.abs(y_true - y_pred)
# Q1a: what's the most appropriate aggregation function for regression, here?
# Q1b: is the below the best encoding for a frame-representative regression loss gradient?
avg_loss_grad = tf.sum(loss_grad) / num_matches # (1,)
reg_loss_grads.append(avg_loss_grad)
# Get the classification loss gradient for the positive matches in this image
# Note that we don't penalize wrong classifications for negative matches below, but we should...
labels_gt = gt_cls_list[b] # (num_matches, num_classes)
logits_pred = pred_cls_list[b] # (num_matches, num_classes)
probs = tf.nn.softmax(logits_pred) # (num_matches, num_classes probas) floats
loss_grad = tf.one_hot(labels_gt, num_classes) - probs # (num_matches, num_classes) floats
# Q2a: what's the most appropriate aggregation function for classification, here?
# Q2b: is the below the best encoding for a frame-representative classification loss gradient?
avg_loss_grad = tf.sum(loss_grad) / num_matches # (num_classes,)
cls_loss_grads.append(avg_loss_grad)
# Save the (batch_size, H, W, num_channels) activations at the end of the CNN backbone
# Save the (batch_size, 1) regression loss gradients
# Save the (batch_size, num_classes) classification loss gradients
ckpt_activations.append(h)
ckpt_reg_loss_grads.append(reg_loss_grads)
ckpt_cls_loss_grads.append(cls_loss_grads)
return imageids, tf.stack(ckpt_reg_loss_grads, axis=-1), tf.stack(ckpt_cls_loss_grads, axis=-1), tf.stack(ckpt_activations, axis=-1)
This may seem like a lot of code, but most of it is actually wordy comments ;)
This is a lot to ask, Frederick, I realize it. Still, I'm hoping you'll be willing to take a quick look and share your take/thoughts on the above. Am I on the right track? Is this in line with what you had tried before? Or, did I completely miss the mark?
As always, many thanks for your time and support!
Cheers, -- Phil
Disclaimer: I do not have the right
solution either :)
At a high level,
# Only use the positive matches for each input sample
pred_reg_list, gt_reg_list, pred_cls_list, gt_cls_list = matched(labels, anchors, pred_reg, pred_cls)
This is a good start, but we might be missing the negative influence examples here. I think we should try both pos and neg and see what the downstream task wants.
p.s. It might be easier next time if you can paste the code somewhere where we can comment and discuss.
Hi Frederick & Co,
Thank you for sharing your awesome work with us! I was wondering if you've already put some thought in how your approach can be extended to analyzing the quality of training data for object detection.
Your technique is fairly straightforward to understand in the context of image classification, where there is one classification head and one can easily find a layer close to the output (e.g., use the weights that connect the layer before the logit layer to the logit layer, as mentioned in the FAQ doc), allowing for the computation of a TracInCP score per frame. But how does one do this in the case of an object detector where there are typically two heads (one for classification, one for bounding box regression) and a varying number of objects per frame? Do you believe that object-level gradients for the objects in a training frame can be aggregated in a meaningful way (e.g., mean or max of gradients across objects for a frame?) such that one can still compute a meaningful dot product between aggregated loss gradients for a training frame and aggregated loss gradients for a test frame? Would this have to be done separately for classification and regression (e.g. a classification proponent may turn out to be a regression opponent, and that would be useful information to surface)?
Are you aware of similar attempts at extending your work to object detection?
Thank you for taking the time to share your thoughts on this, @frederick0329 !