Open JoshVarty opened 5 years ago
prep_im_for_blob(im, pixel_means, target_size, max_size)
Prepare an image for use as a network input blob. Specially:
- Subtract per-channel pixel mean
- Convert to float32
- Rescale to each of the specified target size (capped at max_size) Returns a list of transformed images, one for each target size. Also returns the scale factors that were used to compute each returned image.
im
= Single input image
pixel_means
= np.array([[[102.9801, 115.9465, 122.7717]]])
(from config)
target_size
= 500
(from config)
max_size
=833
(from config)
im_list_to_blob(ims)
Convert a list of images into a network input. Assumes images were prepared using prep_im_for_blob or equivalent: i.e.
- BGR channel order
- pixel means subtracted
- resized to the desired input size
- float32 numpy ndarray format Output is a 4D HCHW tensor of the images concatenated along axis 0 with shape.
COARSEST_STRIDE
(128
=2^7
ie. P7)(2,512,768,3)
to (2,3,512,768)
generate_anchors
This method generates a single reference anchor for every combination of pyramid level, scale and aspect ratio. We have 5 pyramid levels (P3,P4,P5,P6,P7
) 3 anchor sizes (0.5, 1, 2
) and 3 scales (2**0, 2**1/3, 2**2/3
). This gives us 5*3*3
reference anchor boxes. Each anchor box is based on the top, left portion of the image.
_generate_anchors(base_size, scales, aspect_ratios)
def _generate_anchors(base_size, scales, aspect_ratios):
"""Generate anchor (reference) windows by enumerating aspect ratios X
scales wrt a reference (0, 0, base_size - 1, base_size - 1) window.
"""
anchor = np.array([1, 1, base_size, base_size], dtype=np.float) - 1
anchors = _ratio_enum(anchor, aspect_ratios)
anchors = np.vstack(
[_scale_enum(anchors[i, :], scales) for i in range(anchors.shape[0])]
)
return anchors
base_size
= 8
(Believe this represents P3)
scales
= np.array([4])
(calculated by taking sizes and dividing it by stride 8)
aspect_ratios
= 1
anchor
of [0, 0, 7, 7]
[0,0,7,7]
[-12, -12, 19, 19]
This gives us an area of about 32x32
base_size
= 8
(Believe this represents P3)
scales
= np.array([5.039])
(calculated by taking sizes and dividing it by stride 8)
aspect_ratios
= 1
anchor
of [0, 0, 7, 7]
[0,0,7,7]
[-12, -12, 19, 19]
This gives us an area of about 32x32
_ratio_enum(anchor, ratios)
def _ratio_enum(anchor, ratios):
"""Enumerate a set of anchors for each aspect ratio wrt an anchor."""
w, h, x_ctr, y_ctr = _whctrs(anchor)
size = w * h
size_ratios = size / ratios
ws = np.round(np.sqrt(size_ratios))
hs = np.round(ws * ratios)
anchors = _mkanchors(ws, hs, x_ctr, y_ctr)
return anchors
anchor
= array([0., 0., 7., 7.])
ratios
= 1.0
xxyy
to whctrs
w=8
h=8
x_ctr=3.5
y_ctr=3.5
64
)1
in this case)8
in this case)_mkanchors
anchor
= array([0., 0., 7., 7.])
ratios
= 0.5
xxyy
to whctrs
w=8
h=8
x_ctr=3.5
y_ctr=3.5
64
)0.5
in this case gives us 128
)ws=np.round(np.sqrt(128))
gives 11
hs = np.round(ws*ratios)
gives 6
_mkanchors
and get array([[-1.5, 1. , 8.5, 6. ]])
_scale_enum(anchor, ratios)
def _scale_enum(anchor, scales):
"""Enumerate a set of anchors for each scale wrt an anchor."""
w, h, x_ctr, y_ctr = _whctrs(anchor)
ws = w * scales
hs = h * scales
anchors = _mkanchors(ws, hs, x_ctr, y_ctr)
return anchors
anchor
= array([0., 0., 7., 7.])
scales
= np.array([4])
First, conver from xxyy
to whctrs
w=8
h=8
x_ctr=3.5
y_ctr=3.5
Multiply w
and h
by scales
(4
) which gives us ws=hs=32
Call _mkanchors
and return [-12, -12, 19, 19]
anchor
= array([-1.5, 1. , 8.5, 6. ])
scales
= np.array([5.04])
First, conver from xxyy
to whctrs
w=11
h=6
x_ctr=3.5
y_ctr=3.5
Multiply w
and h
by scales
(5.04
) which gives us ws=55.4
and hs=30.2
Call _mkanchors
and return array([[-23.71, -11.11, 30.71, 18.11]])
_mkanchors
def _mkanchors(ws, hs, x_ctr, y_ctr):
"""Given a vector of widths (ws) and heights (hs) around a center
(x_ctr, y_ctr), output a set of anchors (windows).
"""
ws = ws[:, np.newaxis]
hs = hs[:, np.newaxis]
anchors = np.hstack(
(
x_ctr - 0.5 * (ws - 1),
y_ctr - 0.5 * (hs - 1),
x_ctr + 0.5 * (ws - 1),
y_ctr + 0.5 * (hs - 1)
)
)
return anchors
ws
= np.array([8])
hs
= np.array([8])
x_ctr
= 3.5
y_ctr
= 3.5
ws
and hs
taking them from (1,)
to (1,1)
xxyy
and return in an array: array([[0., 0., 7., 7.]])
get_field_of_anchors
We are provided with a stride
, an anchor_size
and an anchor_aspect_ratio
. We first generate a single reference anchor box for the top-left of the image. Then we use the maximum size of any possible image (896x896)
to create a grid of x1y1x2y2
coordinates representing each possible position an anchor could be placed. Then we use our reference anchor box to adjust each anchor box to be the appropriate shape and size.
get_field_of_anchors(stride, anchor_sizes, anchor_aspect_ratios, octave, aspect)
stride=8
anchor_sizes=32
anchor_aspect_ratios=1.0
octave=0
aspect=0
Generate a reference anchor. See above.
[-12,-12,9,9]
Get the max size (896
)
field_size=fpn_max_size/stride
or 896/8
which is 112
shifts=np.arange(0, field_size) * stride
which gives us 112 entries
[0, 8, 16, ... 880, 888]
shift_x, shift_y = np.meshgrid(shifts, shifts)
shift_x
is (112,112)
Each row is [0, 8, ... 880, 888]
shift_y
is (112,112)
Looks like [[0,0,...,0,0], [8,8...8,8]...[888,888...,888]]
ravel()
both into single arrays of size (12544,)
shifts = np.vstack((shift_x, shift_y, shift_x, shift_y)).transpose()
(12544,4)
12544
entries is of the shape x1,y1,x2,y2
[[0,0,0,0],[8,0,8,0],[16,0,16,0]...[880,888,880,888][888,888,888,888]]
Broacast anchors over shifts to enumerate all anchors at all positions in the (H, W) grid:
- add A cell anchors of shape (1, A, 4) to
- K shifts of shape (K, 1, 4) to get
- all shifted anchors of shape (K, A, 4)
- reshape to (K*A, 4) shifted anchors
A = num_cell_anchors
gives 1
(since we're only working with a single anchor at a time)
K = shifts.shape[0]
gives 12544
field_of_anchors = (cell_anchors.reshape((1, A, 4)) + shifts.reshape((1, K, 4)).transpose((1, 0, 2)))
[-12,-12,19,19]
to each shift [0,0,0,0]
, [8,0,8,0]
. They are all of the shape x1y1x2y2
.(12544,1,4)
. Note that this code was written to handle multiple anchors, but we're only working with one at a time.field_of_anchors = field_of_anchors.reshape((K * A, 4))
(12544,4)
since A=1
Pack into foa
and return.
add_retinanet_blobs(blobs, im_scales, roidb, image_width, image_height)
blobs
contains various inputs to our RetinaNet including:
data
A batch of input images. (2,3,512,768)
im_info
The size by which eachretnet_bg_num
retnet_fg_num
retnet_cls_labels_fpn (3-7)
(N,A,H,W)
retnet_roi_bbox_targets_fpn(3-7)
(M,4)
M
= Out of all the anchors generated, depending on the positive/negative IoU
overlap thresholds, we will have M positive anchors. These are the anchors
that bounding box branch will regress on.retnet_roi_fg_bbox_locs_fpn(3-7)
for the bbox regression, since we are only in regressing on fg bboxes which are in number and the output prediction of the network of shape
N x (A * 4) x H x W
(in case of non class-specific bbox), so we the locations of positive fg boxes in this retnet_roi_fg_bbox_locs of shape M x 4 where row looks like: [img_id, anchor_id, x_loc, y_loc]
im_scales
represents how much each image has been scaled by.
roidb
The full information on each of the input images (Region of Interest DB)
image_width
= 768
image_height
= 512
k_max=7
and k_min=3
scales_per_octave
= 3
num_aspect_ratios
= 3
aspect_ratios
= [1.0,2.0,0.5]
anchor_scale
= 4
9
of (12544,4
)9
of (3136,4
)9
of (784,4
)9
of (196,4
)9
of (49,4
)all_anchors
of shape (150381,4)
roidb
scale
by which the image has been resized500
and 749
)im_info
. See: #3get_retinanet_blobs()
retinanet_blobs
contains 45
entries ofretnet_cls_labels
The anchor boxes with classesrenet_roi_bbox_targets
The anchor box targets for positive classesretnet_roi_fg_bbox_locs
Indices of positive classes/bboxes_get_retinanet_blobs(foas, all_anchors, gt_boxes, gt_classes, im_width, im_height)
foas
- 45
field of anchors
all_anchors
- Raw numpy array with all field of anchors stacked. Shape (150381,4)
gt_boxes
- Scaled ground truth boxes for this image
gt_classes
- Ground truth classes for this image
im_width
- Width: 768
im_height
- Height: 512
inds_inside
= [0,1,.. 150379, 150380]
Shape: (150381,)
num_inside
= 150381
Create labels
of size (150831,)
filled with -1
. (1 or more
is positive, 0
negative, -1
ignore)
Compute overlap between all anchor boxes and ground truth boxes (uses Cython) (150381,2)
Map each anchor to the class with highest overlap (150381,)
Keep track of the actual overlap for the max class.
For each ground truth box, find the anchor with the most overlap. Since we have two ground truth boxes, our results look like: array([146001, 138954])
anchor_by_gt_overlap[146001]
array([0.7544641 , 0.25800186], dtype=float32)
anchor_by_gt_overlap[138954]
array([0.20713282, 0.6786918 ], dtype=float32)
0.75
and 0.67
. That feels bad to me. An interesting experiment might be to choose anchor boxes (scales, ratios etc.) by the maximum values we can find across our dataset.Get the maximum amounts of overlap: array([0.7544641, 0.6786918])
Get the indices of the anchors that have these max overlap values (because there may be ties)
Set the labels (previously all -1
) to the class value for these max overlap values
Set the labels to the class value for any anchor with overlap more than 0.5
0.5
we will still have at least one anchor box per gt box?Set fg_inds
to indices of the labels >=1
(non-background/non-ignored)
56
elementsSet bg_inds
to indices where a box's greatest overlap is less than NEGATIVE_OVERLAP
(0.4
)
150217
elementsSet labels
to 0
for all bg_inds
150217
Background108
Don't care (Overlap between 0.4 and 0.5)56
positive class IDsCreate bbox_targets
of shape (150381,4)
Get the bounding box regression targets from comput_targets()
/bbox_transform_inv()
Calls to unmap()
which don't appear to do anything if total_anchors==len(inds_inside)
Create blobs_out=[]
and start_idx=0
For each foa
in foas
:
112
)end_idx=start_idx + H*W
labels
and bbox_targets
from start_idx
to end_idx
start_idx = end_idx
labels
to (1, 1, height, width)
bbox_targets
to shape (1, 4*A, height, width)
eg. (1,4,112,112)
8
96
and 64
80
inds_4d
is np.where(_labels > 0) which is just
4` empty arrays in this case?
M
which seems incorrect/useless hereretnet_cls_labels
: (1,1,64,96)
all 0
retnet_roi_bbox_targets
(0,4)
retnet_roi_fg_bbox_locs
(0,4)
Looking at the 15th foa
56
)end_idx=start_idx + H*W
131712
labels
and bbox_targets
from start_idx
to end_idx
start_idx = end_idx
labels
to (1, 1, height, width)
(1,1,56,56)
bbox_targets
to shape (1, 4*A, height, width)
eg. (1,4,56,56)
16
48
and 32
80
inds_4d
is `np.where(_labels > 0)
[0,0,0,0,0]
[0,0,0,0,0]
[17,17,17,17,17]
[17,18,19,20,21]
_labels[0,0,17,17]
img_inds
(always 0
), y
and x
from inds_4d
_roi_bbox_targets
(5,4)
_roi_fb_bbox_locs
(5,4)
_roi_bbox_targets
_roi_fb_bbox_locs
retnet_cls_labels
: (1,1,32,48)
All labeled anchor boxesretnet_roi_bbox_targets
(5,4)
Regression boxes for positive anchor boxesretnet_roi_fg_bbox_locs
(5,4)
Indices of anchor boxes with classesout_fg_num
= np.array([57
])
1
out_num_bg
= 12021943
(num_bg + 1) * 80 + 57 * 79
Return blobs, out_num_fg, out_num_bg
This method finds the bounding box regression targets for the anchor boxes we've deemed "Foreground". It finds how far we have to shift along the x and y axis and scale the height/width for the boxes to match.
For example if we have a
[222, 249 , 336, 309 ]
in x1y1x2y2
format[216, 250, 409, 316]
in x1y1x2y2
formatWe would get a target: [0.2869193 , 0.06916239, 0.5155515 , 0.09623668]
in xye^we^h
format.
So for width: e^(0.5155515)*(336-222)
we get ~190
which is about (409-216)
but I omitted decimals for readability.
bbox_transform_inv(boxes, gt_boxes, weights)
Description: Inverse transform that computers target bounding-box regression deltas given proposal boxes and ground-truth boxes. The weights argument should be a 4-tuple of multiplicative weights that are applied to the regression target.
In older versions of this code (and in py-faster-rcnn) the weights were set such that the regression deltas would have unit standard deviation on the training dataset. Presently, rather than computing these statistics exactly, we use a fixed set of weights (10, 10, 5, 5) by default. These are approximately thew weights one could get from COCO using th e previous unit stddev heuristic.
boxes
The 56
positive anchor boxes.
gt_boxes
The 56
corresponding ground truth boxes. (The 2 are duplicated as appropriate)
weights
(1,1,1,1)
I'm fairly sure they don't matter
ex_widths
, ex_heights
, ex_ctr_x
and ex_ctr_y
for the boxesgt_widths
, gt_heights
, gt_ctr_x
and gt_ctr_y
for the ground truth boxestargets_dx
[ 0.2869193 , 0.1488844 , 0.01084953...]
targets_dy
[ 0.06916239, 0.06916239, 0.06916239...]
targets_dw
[0.5155515 , 0.5155515 , 0.5155515...]
targets_dh
[ 0.09623668, 0.09623668, 0.09623668...]
targets
targets
[[ 0.2869193 , 0.06916239, 0.5155515 , 0.09623668],...]
The primary codepath starts a number of threads that loads images from disk in minibatches.
The minibatch loader codepath is much smaller, but the individual functions are often more involved and not always immediately clear.