JoshVarty / pytorch-retinanet

Reproducing the Detectron implementation of RetinaNet
MIT License
0 stars 1 forks source link

Minibatch loader codepath #2

Open JoshVarty opened 5 years ago

JoshVarty commented 5 years ago

The primary codepath starts a number of threads that loads images from disk in minibatches.

The minibatch loader codepath is much smaller, but the individual functions are often more involved and not always immediately clear.

minibatch_loader_thread(self)
    get_next_minibatch()
        _get_next_minibatch_inds()
        _get_minibatch(roidb)
            get_mini_batch_blob_names()
                get_retinanet_blob_names(is_training=True)
            _get_image_blob(roidb)
                ❗️prep_im_for_blob(im, pixel_means, target_size, max_size)
                ❗️im_list_to_blob(ims)

         ❗️add_retinanet_blobs(blobs, im_scales, roidb, image_width, image_height)
           ❗️get_field_of_anchors(stride, anchor_sizes, anchor_aspect_ratios, octave=None, aspect=None)
                generate_anchors(stride, sizes, aspect_ratios)
                    ❗️_generate_anchors(base_size, scales, aspect_ratios)
                        _ratio_enum(anchor, ratios)
                        _scale_enum(anchor, scales)
                FieldOfAnchors()
           ❗️_get_retinanet_blobs(foas, all_anchors, gt_boxes, gt_classes, im_width, im_height)
                bbox_overlaps(anchors, gt_boxes)
                compute_targets(ex_rois, gt_rois, weights=(1.0, 1.0, 1.0, 1.0))
                   ❗️bbox_transform_inv(boxes, gt_boxes, weights=(1.0, 1.0, 1.0, 1.0))
                unmap(data, count, inds, fill=0)

    coordinated_put(coordinator, queue, element)
JoshVarty commented 5 years ago

prep_im_for_blob(im, pixel_means, target_size, max_size)

Source | Caller

Prepare an image for use as a network input blob. Specially:

  • Subtract per-channel pixel mean
  • Convert to float32
  • Rescale to each of the specified target size (capped at max_size) Returns a list of transformed images, one for each target size. Also returns the scale factors that were used to compute each returned image.

im = Single input image pixel_means = np.array([[[102.9801, 115.9465, 122.7717]]]) (from config) target_size = 500 (from config) max_size=833 (from config)

JoshVarty commented 5 years ago

im_list_to_blob(ims)

Source | Caller

Convert a list of images into a network input. Assumes images were prepared using prep_im_for_blob or equivalent: i.e.

  • BGR channel order
  • pixel means subtracted
  • resized to the desired input size
  • float32 numpy ndarray format Output is a 4D HCHW tensor of the images concatenated along axis 0 with shape.
  1. Get largest width/height
  2. Pad the image so it can be divisible by the stride defined in COARSEST_STRIDE (128 =2^7 ie. P7)
  3. Swap dimensions.
    • eg. (2,512,768,3) to (2,3,512,768)
JoshVarty commented 5 years ago

Summary of generate_anchors

This method generates a single reference anchor for every combination of pyramid level, scale and aspect ratio. We have 5 pyramid levels (P3,P4,P5,P6,P7) 3 anchor sizes (0.5, 1, 2) and 3 scales (2**0, 2**1/3, 2**2/3). This gives us 5*3*3 reference anchor boxes. Each anchor box is based on the top, left portion of the image.

_generate_anchors(base_size, scales, aspect_ratios)

Source | Caller

def _generate_anchors(base_size, scales, aspect_ratios):
    """Generate anchor (reference) windows by enumerating aspect ratios X
    scales wrt a reference (0, 0, base_size - 1, base_size - 1) window.
    """
    anchor = np.array([1, 1, base_size, base_size], dtype=np.float) - 1
    anchors = _ratio_enum(anchor, aspect_ratios)
    anchors = np.vstack(
        [_scale_enum(anchors[i, :], scales) for i in range(anchors.shape[0])]
    )
    return anchors

base_size = 8 (Believe this represents P3) scales = np.array([4]) (calculated by taking sizes and dividing it by stride 8) aspect_ratios = 1

  1. Create anchor of [0, 0, 7, 7]
  2. Apply aspect ratios to anchors. In this case we still have [0,0,7,7]
  3. Apply scales to the anchors. In this case we get [-12, -12, 19, 19] This gives us an area of about 32x32

base_size = 8 (Believe this represents P3) scales = np.array([5.039]) (calculated by taking sizes and dividing it by stride 8) aspect_ratios = 1

  1. Create anchor of [0, 0, 7, 7]
  2. Apply aspect ratios to anchors. In this case we still have [0,0,7,7]
  3. Apply scales to the anchors. In this case we get [-12, -12, 19, 19] This gives us an area of about 32x32

_ratio_enum(anchor, ratios)

Source | Caller

def _ratio_enum(anchor, ratios):
    """Enumerate a set of anchors for each aspect ratio wrt an anchor."""
    w, h, x_ctr, y_ctr = _whctrs(anchor)
    size = w * h
    size_ratios = size / ratios
    ws = np.round(np.sqrt(size_ratios))
    hs = np.round(ws * ratios)
    anchors = _mkanchors(ws, hs, x_ctr, y_ctr)
    return anchors

anchor = array([0., 0., 7., 7.]) ratios = 1.0

  1. First, conver from xxyy to whctrs
    • w=8
    • h=8
    • x_ctr=3.5
    • y_ctr=3.5
  2. Get area (64)
  3. Get divide area by ratio (1 in this case)
  4. Get adjusted height/width (still 8 in this case)
  5. Call _mkanchors
  6. Return result

anchor = array([0., 0., 7., 7.]) ratios = 0.5

  1. First, conver from xxyy to whctrs
    • w=8
    • h=8
    • x_ctr=3.5
    • y_ctr=3.5
  2. Get area (64)
  3. Divide area by ratio (0.5 in this case gives us 128)
  4. Get adjusted height/width
    • ws=np.round(np.sqrt(128)) gives 11
    • hs = np.round(ws*ratios) gives 6
  5. Call _mkanchors and get array([[-1.5, 1. , 8.5, 6. ]])
  6. Return result

_scale_enum(anchor, ratios)

Source | Caller

def _scale_enum(anchor, scales):
    """Enumerate a set of anchors for each scale wrt an anchor."""
    w, h, x_ctr, y_ctr = _whctrs(anchor)
    ws = w * scales
    hs = h * scales
    anchors = _mkanchors(ws, hs, x_ctr, y_ctr)
    return anchors

anchor = array([0., 0., 7., 7.]) scales = np.array([4])

  1. First, conver from xxyy to whctrs

    • w=8
    • h=8
    • x_ctr=3.5
    • y_ctr=3.5
  2. Multiply w and h by scales (4) which gives us ws=hs=32

  3. Call _mkanchors and return [-12, -12, 19, 19]


anchor = array([-1.5, 1. , 8.5, 6. ]) scales = np.array([5.04])

  1. First, conver from xxyy to whctrs

    • w=11
    • h=6
    • x_ctr=3.5
    • y_ctr=3.5
  2. Multiply w and h by scales (5.04) which gives us ws=55.4 and hs=30.2

  3. Call _mkanchors and return array([[-23.71, -11.11, 30.71, 18.11]])

_mkanchors

Source | Caller

def _mkanchors(ws, hs, x_ctr, y_ctr):
    """Given a vector of widths (ws) and heights (hs) around a center
    (x_ctr, y_ctr), output a set of anchors (windows).
    """
    ws = ws[:, np.newaxis]
    hs = hs[:, np.newaxis]
    anchors = np.hstack(
        (
            x_ctr - 0.5 * (ws - 1),
            y_ctr - 0.5 * (hs - 1),
            x_ctr + 0.5 * (ws - 1),
            y_ctr + 0.5 * (hs - 1)
        )
    )
    return anchors

ws = np.array([8]) hs = np.array([8]) x_ctr = 3.5 y_ctr = 3.5

  1. Add an explicit new axis to ws and hs taking them from (1,) to (1,1)
  2. Convert back to xxyy and return in an array: array([[0., 0., 7., 7.]])
JoshVarty commented 5 years ago

Summary of get_field_of_anchors

We are provided with a stride, an anchor_size and an anchor_aspect_ratio. We first generate a single reference anchor box for the top-left of the image. Then we use the maximum size of any possible image (896x896) to create a grid of x1y1x2y2 coordinates representing each possible position an anchor could be placed. Then we use our reference anchor box to adjust each anchor box to be the appropriate shape and size.

get_field_of_anchors(stride, anchor_sizes, anchor_aspect_ratios, octave, aspect)

Caller | Source

stride=8 anchor_sizes=32 anchor_aspect_ratios=1.0 octave=0 aspect=0

  1. Generate a reference anchor. See above.

    • [-12,-12,9,9]
  2. Get the max size (896)

  3. field_size=fpn_max_size/stride or 896/8 which is 112

  4. shifts=np.arange(0, field_size) * stride which gives us 112 entries

    • [0, 8, 16, ... 880, 888]
  5. shift_x, shift_y = np.meshgrid(shifts, shifts)

    • shift_x is (112,112) Each row is [0, 8, ... 880, 888]
    • shift_y is (112,112) Looks like [[0,0,...,0,0], [8,8...8,8]...[888,888...,888]]
  6. ravel() both into single arrays of size (12544,)

  7. shifts = np.vstack((shift_x, shift_y, shift_x, shift_y)).transpose()

    • Shape: (12544,4)
    • Each of the 12544 entries is of the shape x1,y1,x2,y2
    • Contains [[0,0,0,0],[8,0,8,0],[16,0,16,0]...[880,888,880,888][888,888,888,888]]
  8. Broacast anchors over shifts to enumerate all anchors at all positions in the (H, W) grid:

    • add A cell anchors of shape (1, A, 4) to
    • K shifts of shape (K, 1, 4) to get
    • all shifted anchors of shape (K, A, 4)
    • reshape to (K*A, 4) shifted anchors
  9. A = num_cell_anchors gives 1 (since we're only working with a single anchor at a time)

  10. K = shifts.shape[0] gives 12544

  11. field_of_anchors = (cell_anchors.reshape((1, A, 4)) + shifts.reshape((1, K, 4)).transpose((1, 0, 2)))

    • First we add out anchor [-12,-12,19,19] to each shift [0,0,0,0], [8,0,8,0]. They are all of the shape x1y1x2y2.
    • Then we transpose the dimensions to create a shape of (12544,1,4). Note that this code was written to handle multiple anchors, but we're only working with one at a time.
  12. field_of_anchors = field_of_anchors.reshape((K * A, 4))

    • Reshape into (12544,4) since A=1
  13. Pack into foa and return.

JoshVarty commented 5 years ago

add_retinanet_blobs(blobs, im_scales, roidb, image_width, image_height)

Caller | Source

blobs contains various inputs to our RetinaNet including:

im_scales represents how much each image has been scaled by. roidb The full information on each of the input images (Region of Interest DB) image_width= 768 image_height= 512

  1. Get the max/min levels k_max=7 and k_min=3
  2. scales_per_octave = 3
  3. num_aspect_ratios = 3
  4. aspect_ratios = [1.0,2.0,0.5]
  5. anchor_scale = 4
  6. Get field_of_anchors for each possible combination of pyramid level, apsect ratio and scale. In our case this gives us 45 field of anchors.
    • P3 fields: 9 of (12544,4)
    • P4 fields: 9 of (3136,4)
    • P5 fields: 9 of (784,4)
    • P6 fields: 9 of (196,4)
    • P7 fields: 9 of (49,4)
  7. Combine these into all_anchors of shape (150381,4)
  8. For each entry in the roidb
    1. Get the scale by which the image has been resized
    2. Get the scaled image width and height (eg. 500 and 749)
    3. Rescale the ground truth boxes by the `scale
    4. Get the corresponding classes for these boxes
    5. Store height, with and scale in im_info. See: #3
    6. Call get_retinanet_blobs()
      • retinanet_blobs contains 45 entries of
      • retnet_cls_labels The anchor boxes with classes
      • renet_roi_bbox_targets The anchor box targets for positive classes
      • retnet_roi_fg_bbox_locs Indices of positive classes/bboxes
JoshVarty commented 5 years ago

_get_retinanet_blobs(foas, all_anchors, gt_boxes, gt_classes, im_width, im_height)

Caller | Source

foas - 45 field of anchors all_anchors - Raw numpy array with all field of anchors stacked. Shape (150381,4) gt_boxes - Scaled ground truth boxes for this image gt_classes - Ground truth classes for this image im_width - Width: 768 im_height - Height: 512

  1. inds_inside = [0,1,.. 150379, 150380] Shape: (150381,)

  2. num_inside = 150381

  3. Create labels of size (150831,) filled with -1. (1 or more is positive, 0 negative, -1 ignore)

  4. Compute overlap between all anchor boxes and ground truth boxes (uses Cython) (150381,2)

  5. Map each anchor to the class with highest overlap (150381,)

  6. Keep track of the actual overlap for the max class.

  7. For each ground truth box, find the anchor with the most overlap. Since we have two ground truth boxes, our results look like: array([146001, 138954])

    • anchor_by_gt_overlap[146001] array([0.7544641 , 0.25800186], dtype=float32) anchor_by_gt_overlap[138954] array([0.20713282, 0.6786918 ], dtype=float32)
    • An interesting note is that our very best anchor boxes can only give us 0.75 and 0.67. That feels bad to me. An interesting experiment might be to choose anchor boxes (scales, ratios etc.) by the maximum values we can find across our dataset.
  8. Get the maximum amounts of overlap: array([0.7544641, 0.6786918])

  9. Get the indices of the anchors that have these max overlap values (because there may be ties)

  10. Set the labels (previously all -1) to the class value for these max overlap values

  11. Set the labels to the class value for any anchor with overlap more than 0.5

    • This seems strange to me since we just added only the "max" values a second ago. Why add max at all? I guess when there are no values greater than 0.5 we will still have at least one anchor box per gt box?
  12. Set fg_inds to indices of the labels >=1 (non-background/non-ignored)

    • 56 elements
  13. Set bg_inds to indices where a box's greatest overlap is less than NEGATIVE_OVERLAP (0.4)

    • 150217 elements
  14. Set labels to 0 for all bg_inds

    • Labels now has
      • 150217 Background
      • 108 Don't care (Overlap between 0.4 and 0.5)
      • 56 positive class IDs
  15. Create bbox_targets of shape (150381,4)

  16. Get the bounding box regression targets from comput_targets()/bbox_transform_inv()

  17. Calls to unmap() which don't appear to do anything if total_anchors==len(inds_inside)

    • Investigate further in #4
  18. Create blobs_out=[] and start_idx=0

  19. For each foa in foas:

    1. Get the height and width of the foa (112)
    2. end_idx=start_idx + H*W
    3. Get a subset of the labels and bbox_targets from start_idx to end_idx
    4. start_idx = end_idx
    5. Reshape labels to (1, 1, height, width)
    6. Reshape bbox_targets to shape (1, 4*A, height, width) eg. (1,4,112,112)
    7. Get stride eg. 8
    8. Calculate the number of horizontal and vertical steps 96 and 64
    9. Get number of classes eg 80
    10. inds_4d is np.where(_labels > 0) which is just4` empty arrays in this case?
      • There's a check on M which seems incorrect/useless here
    11. Store the outputs in `blobs
      • retnet_cls_labels: (1,1,64,96) all 0
      • retnet_roi_bbox_targets (0,4)
      • retnet_roi_fg_bbox_locs (0,4)
  20. Looking at the 15th foa

    1. Get the height and width of the foa (56)
    2. end_idx=start_idx + H*W 131712
    3. Get a subset of the labels and bbox_targets from start_idx to end_idx
    4. start_idx = end_idx
    5. Reshape labels to (1, 1, height, width) (1,1,56,56)
    6. Reshape bbox_targets to shape (1, 4*A, height, width) eg. (1,4,56,56)
    7. Get stride eg. 16
    8. Calculate the number of horizontal and vertical steps 48 and 32
    9. Get number of classes eg 80
    10. inds_4d is `np.where(_labels > 0)
      • [0,0,0,0,0]
      • [0,0,0,0,0]
      • [17,17,17,17,17]
      • [17,18,19,20,21]
      • Read them vertically: First entry is _labels[0,0,17,17]
    11. Get img_inds (always 0), y and x from inds_4d
    12. Create _roi_bbox_targets (5,4)
    13. Create _roi_fb_bbox_locs (5,4)
    14. Get each label value, decrement it and put it in _roi_bbox_targets
    15. Get the bbox locations and store in _roi_fb_bbox_locs
    16. Store the outputs in `blobs
      • retnet_cls_labels: (1,1,32,48) All labeled anchor boxes
      • retnet_roi_bbox_targets (5,4) Regression boxes for positive anchor boxes
      • retnet_roi_fg_bbox_locs (5,4) Indices of anchor boxes with classes
  21. out_fg_num = np.array([57])

    • the number of positive anchor boxes + 1
  22. out_num_bg = 12021943

    • The number of background boxes?
    • (num_bg + 1) * 80 + 57 * 79
  23. Return blobs, out_num_fg, out_num_bg

JoshVarty commented 5 years ago

Summary of bbox_transform_inv

This method finds the bounding box regression targets for the anchor boxes we've deemed "Foreground". It finds how far we have to shift along the x and y axis and scale the height/width for the boxes to match.

For example if we have a

We would get a target: [0.2869193 , 0.06916239, 0.5155515 , 0.09623668] in xye^we^h format.

So for width: e^(0.5155515)*(336-222) we get ~190 which is about (409-216) but I omitted decimals for readability.

bbox_transform_inv(boxes, gt_boxes, weights)

Caller | Source

Description: Inverse transform that computers target bounding-box regression deltas given proposal boxes and ground-truth boxes. The weights argument should be a 4-tuple of multiplicative weights that are applied to the regression target.

In older versions of this code (and in py-faster-rcnn) the weights were set such that the regression deltas would have unit standard deviation on the training dataset. Presently, rather than computing these statistics exactly, we use a fixed set of weights (10, 10, 5, 5) by default. These are approximately thew weights one could get from COCO using th e previous unit stddev heuristic.

boxes The 56 positive anchor boxes. gt_boxes The 56 corresponding ground truth boxes. (The 2 are duplicated as appropriate) weights (1,1,1,1) I'm fairly sure they don't matter

  1. Get the ex_widths, ex_heights, ex_ctr_x and ex_ctr_y for the boxes
  2. Get the gt_widths, gt_heights, gt_ctr_x and gt_ctr_y for the ground truth boxes
  3. Find the x offset, y offset, width offset and height offset.
    • targets_dx [ 0.2869193 , 0.1488844 , 0.01084953...]
    • targets_dy [ 0.06916239, 0.06916239, 0.06916239...]
    • targets_dw [0.5155515 , 0.5155515 , 0.5155515...]
    • targets_dh [ 0.09623668, 0.09623668, 0.09623668...]
  4. Vertically stack them into targets
    • targets [[ 0.2869193 , 0.06916239, 0.5155515 , 0.09623668],...]