Computing the offset of anchors

fizyr / keras-retinanet

Keras implementation of RetinaNet object detection.

Apache License 2.0

4.37k stars 1.97k forks source link

Computing the offset of anchors #1073

Open jonilaserson opened 5 years ago

jonilaserson commented 5 years ago

I'm having trouble understanding the computation done in the beginning of utils.anchors.shift():

def shift(shape, stride, anchors):
    # shape  : Shape to shift the anchors over.
    # stride : Stride to shift the anchors with over the shape.
    shift_x = (np.arange(0, shape[1]) + 0.5) * stride
    shift_y = (np.arange(0, shape[0]) + 0.5) * stride

The goal of this method is to tile the prototype-anchors over a grid of the given shape, and offset the x1,y1,x2,y2 coordinates of each anchor so they refer to the coordinate system of the original image.

According to this computation, the center of the top-left anchors is (stride/2, stride/2) regardless of the given shape or the shape of the original image (and from there the anchors are shifted stride pixels apart in every direction). This seems wrong to me.

As an example, assume our backbone is a just single 7x7 Conv2D layer, applied to an image of size 200x200 with stride 1, with no padding (e.g. 'valid'). In this case the output of the backbone is 194x194 (with stride 1), and the center of the top-left anchors should be (3,3) (because this is the center of a 7x7 window stacked to the top-left of an image). This is not close to (stride/2, stride/2)=(0.5, 0.5).

If the padding adds 3 zeros on each direction, and uses a stride of 2 (this is the common first conv layer in ResNet50), then the output is going to be 200x200, and the center of the top-left anchors should be (0,0) (and not (1,1)). Even if the stride was larger than 2 the location of the top-left anchors would still remain (0,0), and further away from (stride/2, stride/2).

I would argue that in order to know the correct offset of the top-left anchors, you need to know the shape of the original image, not just the layer. Then, if H, W = image.shape and h, w = layer.shape, the offset should be:

top_left_offset =  (H - (h - 1) * stride - 1) // 2.0, (W - (w - 1) * stride - 1) // 2.0

What do you think?

hgaiser commented 5 years ago

Hmm I like these issues and I think you're on to something :p

Let me explain what I remember from anchors and feature maps: Each feature "pixel" gets assigned the set of base anchors. Each of these feature pixels has a certain view of the original image. The base anchor (no scaling / ratio modification) should be identical to this view, so the center of both the view of the feature pixel on the input image and the center of the anchor should be identical. This is derived from my understanding on how anchors should work, not from any definition or documentation.

The thought behind the current implementation is that for P3, each feature pixel has a view of the original image of size 32x32 pixels, which strides 8 pixels for each feature pixel. I think there is no argument here right? So if you go from feature pixel (0, 0) to (0, 1) (where this is (x, y)), then the current code assumes the base anchor moves from (-4, -4, 28, 28) to (4, 4, 36, 36).

Let's look at a real world example where the input image is 256x256. P3 will be sized 32x32. To simplify the problem, let's look at one row. The entire width of the view of this single row is 32 + 31 * 8 = 280, since we move 8 pixels for each feature pixel. That means we have a "border" around the original image of 280 - 256 = 24 pixels. In other words, we should shift the anchors by 24 / 2 = 12 pixels.

As in your example:

top_left_offset =  (H - (h - 1) * stride - 1) // 2.0, (W - (w - 1) * stride - 1) // 2.0

Filling in these values we would get:

top_left_offset =  (256 - (32 - 1) * 8 - 1) // 2.0, (256 - (32 - 1) * 8 - 1) // 2.0 = (3, 3)

That is incorrect though, according to the above logic. I would argue that it should be:

top_left_offset =  (H - (size + (h - 1) * stride)) / 2.0, (W - (size + (w - 1) * stride - 1)) / 2.0

Where size is the base size of an anchor (32 in the example above).

Small side note: I removed the double slashes because I don't want to compute them as integers yet, since it just ever so slightly changes the results when you start working with different scales and ratios. I prefer to cast to int at the end.

jonilaserson commented 5 years ago

The thought behind the current implementation is that for P3, each feature pixel has a view of the original image of size 32x32 pixels, which strides 8 pixels for each feature pixel. I think there is no argument here right? So if you go from feature pixel (0, 0) to (0, 1) (where this is (x, y)), then the current code assumes the base anchor moves from (-4, -4, 28, 28) to (4, 4, 36, 36).

The current code assumes that the center of the top left anchor is (4,4). If the receptive field of P3 is indeed 32x32 (I haven't checked) then it means that the base anchor will move from (-12, -12, 20, 20) to (-4, -4, 28, 28). [nitpicking: we're moving from feature pixel (0,0) to (1,1)]

Let's look at a real world example where the input image is 256x256. P3 will be sized 32x32. To simplify the problem, let's look at one row. The entire width of the view of this single row is 32 + 31 * 8 = 280, since we move 8 pixels for each feature pixel. That means we have a "border" around the original image of 280 - 256 = 24 pixels.

I agree so far. This logic implies that we padded the original image with 12 zeros from each side.

In other words, we should shift the anchors by 24 / 2 = 12 pixels.

I disagree here. The center of the first receptive field is 16 pixels from the edge of the padded image, so 4 pixels from the beginning of the original image. Hence the top_left anchor center should be positioned at (4,4), which is indeed (stride/2, stride/2) in this case. So the current code is correct for this specific case.

The small difference between my computation (which results in 3.5 when using / instead of //) and this is due to rounding issues related to the exact location of a point (x,y): is it in the "center" of the [x,y] pixel or the top left of it? You computation assumes that it is in the top-left of the pixel, and I agree with you that it makes more sense. So adapting my computation to this it becomes:

top_left_offset =  (H - (h - 1) * stride) / 2.0, (W - (w - 1) * stride) / 2.0

And then it is now consistent with the above use case. It is also consistent with the 7x7 window examples above, (the center of a 7x7 window stacked to the top-left is now (3.5, 3.5)).

Your computation, using the size of the receptive field, can also work, if you fix it a bit to:

padding_on_top = ((size + (h - 1) * stride) - H) / 2.0
top_offset =   size / 2.0 - padding_on_top

And once you plug the first row into the second, you'll get my result.

hgaiser commented 5 years ago

then it means that the base anchor will move from (-12, -12, 20, 20) to (-4, -4, 28, 28)

Yes, you're absolutely right, I made a mistake there.

[nitpicking: we're moving from feature pixel (0,0) to (1,1)]

Haha yes, you're correct (and being nitpicky is important in these situations ;))

I disagree here. The center of the first receptive field is 16 pixels from the edge of the padded image, so 4 pixels from the beginning of the original image. Hence the top_left anchor center should be positioned at (4,4), which is indeed (stride/2, stride/2) in this case. So the current code is correct for this specific case.

Hmm I think we're talking about the same thing, but I was talking about shifting anchors, you're talking about where the center should be. If the anchor without any modifications is (0, 0, 32, 32) then we should shift by -12 pixels, which moves the center indeed to (4, 4). In the end you get the same anchors, it's just a different approach.

I didn't check before how we generated the base anchors. We center them around (0, 0), so it makes more sense to compute where the center should be, you're right there.

So if I would summarize this, our approach works in cases where the image is a power of 2, but when it is not, then it computes the wrong offset. I'll see if I can work up a PR today, could you review it when it is there?

Thank you so much for this contribution, this kind of feedback is our main reason to have these algorithms open source.

Also, thank you, I forgot the term receptive field ;p

jonilaserson commented 5 years ago

So if I would summarize this, our approach works in cases where the image is a power of 2, but when it is not, then it computes the wrong offset. I'll see if I can work up a PR today, could you review it when it is there?

I think that you could have the wrong offset even when the image is a power of 2, because the offset of the anchor depends on other parameters, mainly the amount of padding added by every layer. The cool part is that if you are using the computation suggested above you don't need to know anything about the padding or the receptive field. Just the stride, the layer shape, and the shape of the input image.

Hmm I think we're talking about the same thing, but I was talking about shifting anchors, you're talking about where the center should be. If the anchor without any modifications is (0, 0, 32, 32) then we should shift by -12 pixels, which moves the center indeed to (4, 4). In the end you get the same anchors, it's just a different approach.

I see your point. It depends how the anchors come out of generate_anchors(). From looking at the code, they come centered at (0,0):

    # initialize output anchors
    anchors = np.zeros((num_anchors, 4))
    ...
    # transform from (x_ctr, y_ctr, w, h) -> (x1, y1, x2, y2)
    anchors[:, 0::2] -= np.tile(anchors[:, 2] * 0.5, (2, 1)).T
    anchors[:, 1::2] -= np.tile(anchors[:, 3] * 0.5, (2, 1)).T

So they need to be offsett by 4 in each direction.

Thank you so much for this contribution, this kind of feedback is our main reason to have these algorithms open source.

Also, thank you, I forgot the term receptive field ;p

My pleasure! I will look at the MR too.

jonilaserson commented 5 years ago

Also, it might as well be that the current code is correct for ResNet50 (which makes sense since many people have tried it successfully), but this computation could potentially explain cases where it fails with other backbones.

nikostsagk commented 5 years ago

According to this computation, the center of the top-left anchors is (stride/2, stride/2) regardless of the given shape or the shape of the original image (and from there the anchors are shifted stride pixels apart in every direction). This seems wrong to me.

This stride is a result of the number of times the original input has been subsampled. In other words, how many pixels, a pixel from the feature map covers in the original input.

Let's take an image with size 800x800, P3 output is going to be 100x100 according to code. This is 8 times smaller than the original input. P3's output is a spatial map where each pixel covers an area of 8x8 pixels. That is the top left pixel of this feature map (location (0,0)) covers an area in the original input from (0,0) to (7,7).

Now, if we had to represent this pixel on the feature map, with an anchor of ratio=1, scale=1 and size=32, I believe you would all agree that this anchor would have it's top left corner at 16 - (4,4) = (-12,-12) and the bottom right corner at 16 + (4,4)= (20,20), as it was before the new modifications.

So, the proposed anchors with the aforementioned ratio, scale, size, stride along the x-axis go as this: [ -12. -12. 20. 20.] [ - 4. -12. 28. 20.] [ 4. -12. 36. 20.] [ 12. -12. 44. 20.], where these numbers represent the original input image.

As an example, assume our backbone is a just single 7x7 Conv2D layer, applied to an image of size 200x200 with stride 1, with no padding (e.g. 'valid'). In this case the output of the backbone is 194x194 (with stride 1), and the center of the top-left anchors should be (3,3) (because this is the center of a 7x7 window stacked to the top-left of an image).

I am not sure why the top-left anchor should be (3,3). The convolution that pyramid levels undergo (in function default_regression_model), have a kernel=3, stride=1 and same outcome, this means that when the first convolution happens, the center of the kernel is always the same with the top left pixel of the pyramid output.

I believe that the way anchors were been calculated before was correct. What I argue however, is how the output of the pyramid levels is being calculated according to here:

image_shape = np.array(image_shape[:2])
image_shapes = [(image_shape + 2 ** x - 1) // (2 ** x) for x in pyramid_levels]
return image_shapes

Feeding the example image in the repo with size 800x1067, gives pyramid outputs of: [(100x134), (50x67), (25x34), (13x17), (7x9)]

and the model itself with this code:

from keras import models
from keras_retinanet.utils.image import read_image_bgr, preprocess_image, resize_image

model = models.load_model('path/to/a/model', backbone_name='vgg16')
img = read_image_bgr('000000008021.jpg')
img, scale = resize_image(preprocess_image(img))

# convert image to tensor
img_tensor = image.img_to_array(img)
img_tensor = np.expand_dims(img_tensor, axis=0)

for p in ['P3','P4','P5','P6','P7']:
    sub_model = keras.models.Model(inputs=model.input, outputs=model.get_layer(p).output)
    features = sub_model.predict(img_tensor)
    print(features.shape)

gives: [(100x133), (50x66), (25x33), (13x17), (7x9)]

I would love to hear some feedback from you and correct me if I am wrong.

jonilaserson commented 5 years ago

I want to emphasize that there is no difference between the new computation (in this MR) and the previous computation for ResNet50 (and maybe also the other supported backbones). The center of the top-left anchor will come down to (stride/2, stride/2) in both the new computation and the old computation for these cases. However, the previous computation does not generalize to some other backbones, while the new one does.

Let's take an image with size 800x800, P3 output is going to be 100x100 according to code.

According to what code? This happens to be true for the current implementation of ResNet50 but this is not true in general. For example, if you do 5x5 conv -> 2x2 maxpool -> 5x5 conv -> 2x2 maxpool -> 5x5 conv -> 2x2 maxpool and your convolutions are "valid" (without padding), then your output will be 96x96.

I am not sure why the top-left anchor should be (3,3).

So it is actually in (3.5, 3.5), (I corrected this later in the post). If you stack a 7x7 matrix to the top left of the image (without padding) then the center of that 7x7 matrix will be in (3.5, 3.5). Do you agree? So if you started with a 200x200 image, you will have after one layer a 194x194 feature map, and the anchors in the top-left pixel of this feature map should be centered in (3.5, 3.5).

hgaiser commented 5 years ago

They're not the same though, the old and new computations, even for resnet50. It depends on the input image shape if they're the same or not. Run the computations with an image of shape 200x200 and you'll see.

jonilaserson commented 5 years ago

You're right. It's the same if is is divisible by 32, I think. So yes, this computation might actually improve the results for images in other shapes, like 200x200.

nikostsagk commented 5 years ago

So it is actually in (3.5, 3.5), (I corrected this later in the post). If you stack a 7x7 matrix to the top left of the image (without padding) then the center of that 7x7 matrix will be in (3.5, 3.5). Do you agree? So if you started with a 200x200 image, you will have after one layer a 194x194 feature map, and the anchors in the top-left pixel of this feature map should be centered in (3.5, 3.5).

Hmm. Yes, you seem right. I was thinking the whole concept in the context of VGG16.

I like the idea that now it generalises better and I think that I get your point, but some results are not clear to me yet. In the VGG example with the input 800x800 and strides in pyramids [8, 16, 32, 64, 128], the offsets are in P3 (4.0, 4.0), in P4 (8.0 ,8.0), P5 (16.0, 16.0) as expected, and then P6 and P7 have an offset of (16.0, 16.0). Can the reason be that, P6, and P7, are coming from kernels with a stride = 2?

jonilaserson commented 5 years ago

So in your example (input 800x800): P3 has shape 100x100 and stride 8x8. Hence the offset should be (800 - 899)/2 = 4 P4 has shape 50x50 and stride 16x16. Hence the offset should be (800 - 1649)/2 = 8 P5 has shape 25x25 and stride 32x32. Hence the offset should be (800 - 3224)/2 = 16 P6 has shape 13x13 and stride 64x64. Hence the offset should be (800 - 6412)/2 = 16 P7 has shape 7x7 and stride 128x128. Hence the offset should be (800 - 128*6)/2 = 16

So as you can see, the reason is that in P6 and P7 we did an imperfect "max-pooling": Each one of the 13x13 pixels should cover 2x2 pixels from the previous 25x25 layer, so this means that we implicitly added a padding of 1 somewhere (because 13x13 covers 26x26 area). The question is, where? This computation assumes that we added the padding symmetrically (i.e. added 0.5 from each side). But I don't think maxpooling can do that.

So I'm guessing the maxpooling appends the 25x25 layer with another column and row of zeros somewhere. Probably on the bottom, probably on the right. This breaks the symmetry. If feature (0,0) in P5 was centered at (16, 16), and the feature (1,1) was centered at (48, 48), (because in P5 stride=32), then after we maxpool features (0,0), (0,1), (1,0), (1,1), the center should move to (32,32).

So you're right - this computation is wrong, assuming max-pooling adds an asymmetric padding.

On Thu, Jul 25, 2019 at 1:48 AM ntsagko notifications@github.com wrote:

So it is actually in (3.5, 3.5), (I corrected this later in the post). If you stack a 7x7 matrix to the top left of the image (without padding) then the center of that 7x7 matrix will be in (3.5, 3.5). Do you agree? So if you started with a 200x200 image, you will have after one layer a 194x194 feature map, and the anchors in the top-left pixel of this feature map should be centered in (3.5, 3.5).

Hmm. Yes, you seem right. I was thinking the whole concept in the context of VGG16.

I like the idea that now it generalises better and I think that I get your point, but some results are not clear to me yet. In the VGG example with the input 800x800 and strides in pyramids [8, 16, 32, 64, 128], the offsets are in P3 (4.0, 4.0), in P4 (8.0 ,8.0), P5 (16.0, 16.0) as expected, and then P6 and P7 have an offset of (16.0, 16.0). Can the reason be that, P6, and P7, are coming from kernels with a stride = 2?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/fizyr/keras-retinanet/issues/1073?email_source=notifications&email_token=ABLUWCEFXEWA266HCFZFXMLQBDL3DA5CNFSM4ID7NNO2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2X2KKA#issuecomment-514827560, or mute the thread https://github.com/notifications/unsubscribe-auth/ABLUWCBKUKIBY4LA4LP6TR3QBDL3DANCNFSM4ID7NNOQ .

nikostsagk commented 5 years ago

Thanks for breaking it down for me. Seems to work! 😀

xytpai commented 5 years ago

test

stale[bot] commented 4 years ago

This issue has been automatically marked as stale due to the lack of recent activity. It will be closed if no further activity occurs. Thank you for your contributions.