Minibatch loader codepath

The primary codepath starts a number of threads that loads images from disk in minibatches.

The minibatch loader codepath is much smaller, but the individual functions are often more involved and not always immediately clear.

minibatch_loader_thread(self)
    get_next_minibatch()
        _get_next_minibatch_inds()
        _get_minibatch(roidb)
            get_mini_batch_blob_names()
                get_retinanet_blob_names(is_training=True)
            _get_image_blob(roidb)
                ❗️prep_im_for_blob(im, pixel_means, target_size, max_size)
                ❗️im_list_to_blob(ims)

         ❗️add_retinanet_blobs(blobs, im_scales, roidb, image_width, image_height)
           ❗️get_field_of_anchors(stride, anchor_sizes, anchor_aspect_ratios, octave=None, aspect=None)
                generate_anchors(stride, sizes, aspect_ratios)
                    ❗️_generate_anchors(base_size, scales, aspect_ratios)
                        _ratio_enum(anchor, ratios)
                        _scale_enum(anchor, scales)
                FieldOfAnchors()
           ❗️_get_retinanet_blobs(foas, all_anchors, gt_boxes, gt_classes, im_width, im_height)
                bbox_overlaps(anchors, gt_boxes)
                compute_targets(ex_rois, gt_rois, weights=(1.0, 1.0, 1.0, 1.0))
                   ❗️bbox_transform_inv(boxes, gt_boxes, weights=(1.0, 1.0, 1.0, 1.0))
                unmap(data, count, inds, fill=0)

    coordinated_put(coordinator, queue, element)

`prep_im_for_blob(im, pixel_means, target_size, max_size)`

Source | Caller

Prepare an image for use as a network input blob. Specially:

Subtract per-channel pixel mean

Convert to float32

Rescale to each of the specified target size (capped at max_size) Returns a list of transformed images, one for each target size. Also returns the scale factors that were used to compute each returned image.

im = Single input image pixel_means = np.array([[[102.9801, 115.9465, 122.7717]]]) (from config) target_size = 500 (from config) max_size=833 (from config)

`im_list_to_blob(ims)`

Source | Caller

Convert a list of images into a network input. Assumes images were prepared using prep_im_for_blob or equivalent: i.e.

BGR channel order

pixel means subtracted

resized to the desired input size

float32 numpy ndarray format Output is a 4D HCHW tensor of the images concatenated along axis 0 with shape.

Get largest width/height
Pad the image so it can be divisible by the stride defined in COARSEST_STRIDE (128 =2^7 ie. P7)
Swap dimensions.
- eg. (2,512,768,3) to (2,3,512,768)

Summary of `generate_anchors`

This method generates a single reference anchor for every combination of pyramid level, scale and aspect ratio. We have 5 pyramid levels (P3,P4,P5,P6,P7) 3 anchor sizes (0.5, 1, 2) and 3 scales (2**0, 2**1/3, 2**2/3). This gives us 5*3*3 reference anchor boxes. Each anchor box is based on the top, left portion of the image.

`_generate_anchors(base_size, scales, aspect_ratios)`

Source | Caller

def _generate_anchors(base_size, scales, aspect_ratios):
    """Generate anchor (reference) windows by enumerating aspect ratios X
    scales wrt a reference (0, 0, base_size - 1, base_size - 1) window.
    """
    anchor = np.array([1, 1, base_size, base_size], dtype=np.float) - 1
    anchors = _ratio_enum(anchor, aspect_ratios)
    anchors = np.vstack(
        [_scale_enum(anchors[i, :], scales) for i in range(anchors.shape[0])]
    )
    return anchors

base_size = 8 (Believe this represents P3) scales = np.array([4]) (calculated by taking sizes and dividing it by stride 8) aspect_ratios = 1

Create anchor of [0, 0, 7, 7]
Apply aspect ratios to anchors. In this case we still have [0,0,7,7]
Apply scales to the anchors. In this case we get [-12, -12, 19, 19] This gives us an area of about 32x32

base_size = 8 (Believe this represents P3) scales = np.array([5.039]) (calculated by taking sizes and dividing it by stride 8) aspect_ratios = 1

Create anchor of [0, 0, 7, 7]
Apply aspect ratios to anchors. In this case we still have [0,0,7,7]
Apply scales to the anchors. In this case we get [-12, -12, 19, 19] This gives us an area of about 32x32

`_ratio_enum(anchor, ratios)`

Source | Caller

def _ratio_enum(anchor, ratios):
    """Enumerate a set of anchors for each aspect ratio wrt an anchor."""
    w, h, x_ctr, y_ctr = _whctrs(anchor)
    size = w * h
    size_ratios = size / ratios
    ws = np.round(np.sqrt(size_ratios))
    hs = np.round(ws * ratios)
    anchors = _mkanchors(ws, hs, x_ctr, y_ctr)
    return anchors

anchor = array([0., 0., 7., 7.]) ratios = 1.0

First, conver from xxyy to whctrs
- w=8
- h=8
- x_ctr=3.5
- y_ctr=3.5
Get area (64)
Get divide area by ratio (1 in this case)
Get adjusted height/width (still 8 in this case)
Call _mkanchors
Return result

anchor = array([0., 0., 7., 7.]) ratios = 0.5

First, conver from xxyy to whctrs
- w=8
- h=8
- x_ctr=3.5
- y_ctr=3.5
Get area (64)
Divide area by ratio (0.5 in this case gives us 128)
Get adjusted height/width
- ws=np.round(np.sqrt(128)) gives 11
- hs = np.round(ws*ratios) gives 6
Call _mkanchors and get array([[-1.5, 1. , 8.5, 6. ]])
Return result

`_scale_enum(anchor, ratios)`

Source | Caller

def _scale_enum(anchor, scales):
    """Enumerate a set of anchors for each scale wrt an anchor."""
    w, h, x_ctr, y_ctr = _whctrs(anchor)
    ws = w * scales
    hs = h * scales
    anchors = _mkanchors(ws, hs, x_ctr, y_ctr)
    return anchors

anchor = array([0., 0., 7., 7.]) scales = np.array([4])

First, conver from xxyy to whctrs
- w=8
- h=8
- x_ctr=3.5
- y_ctr=3.5
Multiply w and h by scales (4) which gives us ws=hs=32
Call _mkanchors and return [-12, -12, 19, 19]

anchor = array([-1.5, 1. , 8.5, 6. ]) scales = np.array([5.04])

First, conver from xxyy to whctrs
- w=11
- h=6
- x_ctr=3.5
- y_ctr=3.5
Multiply w and h by scales (5.04) which gives us ws=55.4 and hs=30.2
Call _mkanchors and return array([[-23.71, -11.11, 30.71, 18.11]])

`_mkanchors`

Source | Caller

def _mkanchors(ws, hs, x_ctr, y_ctr):
    """Given a vector of widths (ws) and heights (hs) around a center
    (x_ctr, y_ctr), output a set of anchors (windows).
    """
    ws = ws[:, np.newaxis]
    hs = hs[:, np.newaxis]
    anchors = np.hstack(
        (
            x_ctr - 0.5 * (ws - 1),
            y_ctr - 0.5 * (hs - 1),
            x_ctr + 0.5 * (ws - 1),
            y_ctr + 0.5 * (hs - 1)
        )
    )
    return anchors

ws = np.array([8]) hs = np.array([8]) x_ctr = 3.5 y_ctr = 3.5

Add an explicit new axis to ws and hs taking them from (1,) to (1,1)
Convert back to xxyy and return in an array: array([[0., 0., 7., 7.]])

Summary of `get_field_of_anchors`

We are provided with a stride, an anchor_size and an anchor_aspect_ratio. We first generate a single reference anchor box for the top-left of the image. Then we use the maximum size of any possible image (896x896) to create a grid of x1y1x2y2 coordinates representing each possible position an anchor could be placed. Then we use our reference anchor box to adjust each anchor box to be the appropriate shape and size.

`get_field_of_anchors(stride, anchor_sizes, anchor_aspect_ratios, octave, aspect)`

Caller | Source

stride=8 anchor_sizes=32 anchor_aspect_ratios=1.0 octave=0 aspect=0

Generate a reference anchor. See above.
- [-12,-12,9,9]
Get the max size (896)
field_size=fpn_max_size/stride or 896/8 which is 112
shifts=np.arange(0, field_size) * stride which gives us 112 entries
- [0, 8, 16, ... 880, 888]
shift_x, shift_y = np.meshgrid(shifts, shifts)
- shift_x is (112,112) Each row is [0, 8, ... 880, 888]
- shift_y is (112,112) Looks like [[0,0,...,0,0], [8,8...8,8]...[888,888...,888]]
ravel() both into single arrays of size (12544,)
shifts = np.vstack((shift_x, shift_y, shift_x, shift_y)).transpose()
- Shape: (12544,4)
- Each of the 12544 entries is of the shape x1,y1,x2,y2
- Contains [[0,0,0,0],[8,0,8,0],[16,0,16,0]...[880,888,880,888][888,888,888,888]]
Broacast anchors over shifts to enumerate all anchors at all positions in the (H, W) grid:
- add A cell anchors of shape (1, A, 4) to
- K shifts of shape (K, 1, 4) to get
- all shifted anchors of shape (K, A, 4)
- reshape to (K*A, 4) shifted anchors
A = num_cell_anchors gives 1 (since we're only working with a single anchor at a time)
K = shifts.shape[0] gives 12544
field_of_anchors = (cell_anchors.reshape((1, A, 4)) + shifts.reshape((1, K, 4)).transpose((1, 0, 2)))
- First we add out anchor [-12,-12,19,19] to each shift [0,0,0,0], [8,0,8,0]. They are all of the shape x1y1x2y2.
- Then we transpose the dimensions to create a shape of (12544,1,4). Note that this code was written to handle multiple anchors, but we're only working with one at a time.
field_of_anchors = field_of_anchors.reshape((K * A, 4))
- Reshape into (12544,4) since A=1
Pack into foa and return.

`add_retinanet_blobs(blobs, im_scales, roidb, image_width, image_height)`

Caller | Source

blobs contains various inputs to our RetinaNet including:

data A batch of input images. (2,3,512,768)
im_info The size by which each
retnet_bg_num
retnet_fg_num
retnet_cls_labels_fpn (3-7)
- labels for the classification branch for each FPN level
- Shape: (N,A,H,W)
retnet_roi_bbox_targets_fpn(3-7)
- targets for the bbox regression branch
- Shape (M,4)
- M = Out of all the anchors generated, depending on the positive/negative IoU overlap thresholds, we will have M positive anchors. These are the anchors that bounding box branch will regress on.
retnet_roi_fg_bbox_locs_fpn(3-7) for the bbox regression, since we are only in regressing on fg bboxes which are in number and the output prediction of the network of shape N x (A * 4) x H x W (in case of non class-specific bbox), so we the locations of positive fg boxes in this retnet_roi_fg_bbox_locs of shape M x 4 where row looks like: [img_id, anchor_id, x_loc, y_loc]

im_scales represents how much each image has been scaled by. roidb The full information on each of the input images (Region of Interest DB) image_width= 768 image_height= 512

Get the max/min levels k_max=7 and k_min=3
scales_per_octave = 3
num_aspect_ratios = 3
aspect_ratios = [1.0,2.0,0.5]
anchor_scale = 4
Get field_of_anchors for each possible combination of pyramid level, apsect ratio and scale. In our case this gives us 45 field of anchors.
- P3 fields: 9 of (12544,4)
- P4 fields: 9 of (3136,4)
- P5 fields: 9 of (784,4)
- P6 fields: 9 of (196,4)
- P7 fields: 9 of (49,4)
Combine these into all_anchors of shape (150381,4)
For each entry in the roidb
1. Get the scale by which the image has been resized
2. Get the scaled image width and height (eg. 500 and 749)
3. Rescale the ground truth boxes by the `scale
4. Get the corresponding classes for these boxes
5. Store height, with and scale in im_info. See: #3
6. Call get_retinanet_blobs()
  - retinanet_blobs contains 45 entries of
  - retnet_cls_labels The anchor boxes with classes
  - renet_roi_bbox_targets The anchor box targets for positive classes
  - retnet_roi_fg_bbox_locs Indices of positive classes/bboxes

`_get_retinanet_blobs(foas, all_anchors, gt_boxes, gt_classes, im_width, im_height)`

Caller | Source

foas - 45 field of anchors all_anchors - Raw numpy array with all field of anchors stacked. Shape (150381,4) gt_boxes - Scaled ground truth boxes for this image gt_classes - Ground truth classes for this image im_width - Width: 768 im_height - Height: 512

inds_inside = [0,1,.. 150379, 150380] Shape: (150381,)
num_inside = 150381
Create labels of size (150831,) filled with -1. (1 or more is positive, 0 negative, -1 ignore)
Compute overlap between all anchor boxes and ground truth boxes (uses Cython) (150381,2)
Map each anchor to the class with highest overlap (150381,)
Keep track of the actual overlap for the max class.
For each ground truth box, find the anchor with the most overlap. Since we have two ground truth boxes, our results look like: array([146001, 138954])
- anchor_by_gt_overlap[146001] array([0.7544641 , 0.25800186], dtype=float32) anchor_by_gt_overlap[138954] array([0.20713282, 0.6786918 ], dtype=float32)
- An interesting note is that our very best anchor boxes can only give us 0.75 and 0.67. That feels bad to me. An interesting experiment might be to choose anchor boxes (scales, ratios etc.) by the maximum values we can find across our dataset.
Get the maximum amounts of overlap: array([0.7544641, 0.6786918])
Get the indices of the anchors that have these max overlap values (because there may be ties)
Set the labels (previously all -1) to the class value for these max overlap values
Set the labels to the class value for any anchor with overlap more than 0.5
- This seems strange to me since we just added only the "max" values a second ago. Why add max at all? I guess when there are no values greater than 0.5 we will still have at least one anchor box per gt box?
Set fg_inds to indices of the labels >=1 (non-background/non-ignored)
- 56 elements
Set bg_inds to indices where a box's greatest overlap is less than NEGATIVE_OVERLAP (0.4)
- 150217 elements
Set labels to 0 for all bg_inds
- Labels now has
  - 150217 Background
  - 108 Don't care (Overlap between 0.4 and 0.5)
  - 56 positive class IDs
Create bbox_targets of shape (150381,4)
Get the bounding box regression targets from comput_targets()/bbox_transform_inv()
Calls to unmap() which don't appear to do anything if total_anchors==len(inds_inside)
- Investigate further in #4
Create blobs_out=[] and start_idx=0
For each foa in foas:
1. Get the height and width of the foa (112)
2. end_idx=start_idx + H*W
3. Get a subset of the labels and bbox_targets from start_idx to end_idx
4. start_idx = end_idx
5. Reshape labels to (1, 1, height, width)
6. Reshape bbox_targets to shape (1, 4*A, height, width) eg. (1,4,112,112)
7. Get stride eg. 8
8. Calculate the number of horizontal and vertical steps 96 and 64
9. Get number of classes eg 80
10. inds_4d is np.where(_labels > 0) which is just4` empty arrays in this case?
  - There's a check on M which seems incorrect/useless here
11. Store the outputs in `blobs
  - retnet_cls_labels: (1,1,64,96) all 0
  - retnet_roi_bbox_targets (0,4)
  - retnet_roi_fg_bbox_locs (0,4)
Looking at the 15th foa
1. Get the height and width of the foa (56)
2. end_idx=start_idx + H*W 131712
3. Get a subset of the labels and bbox_targets from start_idx to end_idx
4. start_idx = end_idx
5. Reshape labels to (1, 1, height, width) (1,1,56,56)
6. Reshape bbox_targets to shape (1, 4*A, height, width) eg. (1,4,56,56)
7. Get stride eg. 16
8. Calculate the number of horizontal and vertical steps 48 and 32
9. Get number of classes eg 80
10. inds_4d is `np.where(_labels > 0)
  - [0,0,0,0,0]
  - [0,0,0,0,0]
  - [17,17,17,17,17]
  - [17,18,19,20,21]
  - Read them vertically: First entry is _labels[0,0,17,17]
11. Get img_inds (always 0), y and x from inds_4d
12. Create _roi_bbox_targets (5,4)
13. Create _roi_fb_bbox_locs (5,4)
14. Get each label value, decrement it and put it in _roi_bbox_targets
15. Get the bbox locations and store in _roi_fb_bbox_locs
16. Store the outputs in `blobs
  - retnet_cls_labels: (1,1,32,48) All labeled anchor boxes
  - retnet_roi_bbox_targets (5,4) Regression boxes for positive anchor boxes
  - retnet_roi_fg_bbox_locs (5,4) Indices of anchor boxes with classes
out_fg_num = np.array([57])
- the number of positive anchor boxes + 1
out_num_bg = 12021943
- The number of background boxes?
- (num_bg + 1) * 80 + 57 * 79
Return blobs, out_num_fg, out_num_bg

Summary of bbox_transform_inv

This method finds the bounding box regression targets for the anchor boxes we've deemed "Foreground". It finds how far we have to shift along the x and y axis and scale the height/width for the boxes to match.

For example if we have a

box [222, 249 , 336, 309 ] in x1y1x2y2 format
gtbox [216, 250, 409, 316] in x1y1x2y2 format

We would get a target: [0.2869193 , 0.06916239, 0.5155515 , 0.09623668] in xye^we^h format.

So for width: e^(0.5155515)*(336-222) we get ~190 which is about (409-216) but I omitted decimals for readability.

`bbox_transform_inv(boxes, gt_boxes, weights)`

Caller | Source

Description: Inverse transform that computers target bounding-box regression deltas given proposal boxes and ground-truth boxes. The weights argument should be a 4-tuple of multiplicative weights that are applied to the regression target.

In older versions of this code (and in py-faster-rcnn) the weights were set such that the regression deltas would have unit standard deviation on the training dataset. Presently, rather than computing these statistics exactly, we use a fixed set of weights (10, 10, 5, 5) by default. These are approximately thew weights one could get from COCO using th e previous unit stddev heuristic.

boxes The 56 positive anchor boxes. gt_boxes The 56 corresponding ground truth boxes. (The 2 are duplicated as appropriate) weights (1,1,1,1) I'm fairly sure they don't matter

Get the ex_widths, ex_heights, ex_ctr_x and ex_ctr_y for the boxes
Get the gt_widths, gt_heights, gt_ctr_x and gt_ctr_y for the ground truth boxes
Find the x offset, y offset, width offset and height offset.
- targets_dx [ 0.2869193 , 0.1488844 , 0.01084953...]
- targets_dy [ 0.06916239, 0.06916239, 0.06916239...]
- targets_dw [0.5155515 , 0.5155515 , 0.5155515...]
- targets_dh [ 0.09623668, 0.09623668, 0.09623668...]
Vertically stack them into targets
- targets [[ 0.2869193 , 0.06916239, 0.5155515 , 0.09623668],...]

JoshVarty / pytorch-retinanet