dbolya / yolact

A simple, fully convolutional model for real-time instance segmentation.
MIT License
5k stars 1.33k forks source link

[Question] Mask Coefficients #171

Closed timdah closed 4 years ago

timdah commented 4 years ago

Hi, i have a question about the mask coefficients. As the k mask coefficients are in relation to the k prototypes i wonder how these values are predicted or supervised. If i got it right all tasks run in parallel, so the prototypes and the mask coefficients are generated independently of each other. The paper says there's no loss for the prototypes so i guess backpropagation only takes place for the mask coefficients, but how can the final mask loss be traced back to k different coefficients. And since the prototypes are not supervised at all how can you say which prototypes are good or bad for a prediction. In summary, it is unclear to me how the mask coefficient branch is trained.

I would really appreciate some deeper explanation on the mask coefficient branch :blush:

dbolya commented 4 years ago

So neither the k mask coefficients nor the k prototypes have losses directly on them. But they get supervision from the final mask loss. For completeness's sake (and so I can reference this in the future), I'll go over the entire process to compute the mask loss in detail.

Let's say for one instance you have vector c (1 x k) and prototype matrix P (h x w x k). For the ground truth, we also have gt_box (1 x 4) and a binary gt_mask (h x w x 1). Then to compute the loss, we first compute the assembled mask:

mask = sigmoid(P @ c.t())

(where @ is matrix multiplication and .t() is transpose).

Then the loss is

mask_loss_tensor = binary_cross_entropy(mask, ground_truth, reduction=None)
# eq to
# mask_loss_tensor = - gt_mask * log(mask) - (1 - gt_mask) * log(1 - mask)

Now as per the extra step in the paper, we crop the mask loss with the gt box for stability (hence why we don't sum the loss yet).

mask_loss_crop = crop(mask_loss_tensor, gt_box)

Finally, we sum the loss and normalize by the gt bounding box area:

mask_loss = mask_loss_crop.sum() / (gt_box.w * gt_box.h)

So both the k prototypes and k mask coefficients get supervision through the

mask = sigmoid(P @ c.t())

step (I think it's eq 1 in the paper). As for how does the network decides which instances require which of the k prototypes, in general that's down to random initialization and the way gradient descent takes the weights.

For instance, let's take a look at how backpropagation would update the weights for protonet (note that the derivative of sigmoid(x) is sigmoid(x) (1 - sigmoid(x))):

mask = sigmoid(P_1 c_1 + ... + P_k c_k)
∇ mask(P) = mask (1 - mask) * c.t()

So the partial derivative w.r.t. P_1 for instance, would be

∂ mask(P) / ∂ P_1 = mask (1 - mask) * c_1

I.e., the "loss signal" that prototype 1 gets is essentially just weighted by c_1, and the pixels that get loss are weighted by mask (1 - mask) (and the derivative of mask_loss, but can't be arsed to work that out too).

In the simplest terms, if c_i was high and there was a high error, then backprop will try to reduce the activations of P_i (and visa versa for negative coefficients).

Then, if you work out the math for the other branch (i.e., ∇ mask(P)), it will be similar: if the pixels in P_i where the final mask loss is high are high, then backprop will try to make c_i less positive (and less negative if it's the other way around).

Sorry for the long post and random mix of pseudocode and pytorch notation. Hopefully, this should clear up how training works, but let me know if anything needs to be clarified.

timdah commented 4 years ago

Thank you very much for the quick and detailed reply! This cleared up the questions for me.

timdah commented 4 years ago

Hi @dbolya , I have three more small question but do not want to create a new issue for it.

1.

I'm not sure if I have understood correctly how the sizes of the anchors come about.

and place 3 anchors with aspect ratios [1; 1/2; 2] on each. The anchors of P3 have areas of 24 pixels squared, and every subsequent layer has double the scale of the previous (resulting in the scales [24; 48; 96; 192; 384]).

So starting from P3 with w=h=69 we have ~1/8 of the input image size so I suppose the anchors have the following sizes [3x3; 2x4; 4x2] because 3x8 is the size of the anchor in relation to the input resolution. Is this assumption correct? In that case the scales [24; 48...] refere only to the anchor with aspect ratio 1 right?


2.

How many anchors are there in total? Are the three anchors put on each feature on each P level? So that we have 3*(692+352+182+92+52) = 19248?


3.

Is the predicted boundig box relative to the anchor? Like vector from anchor box center to bounding box center and width and height relative to anchor size?

dbolya commented 4 years ago

1.

By "sizes" you mean relative to P3? If so, then yes, the ar=1 anchor is 3x3. However, the others are ~4.24 x 2.12 and ~2.12 x 4.24. If you were rounding, then you're correct, but I don't round. As in that excerpt, "the anchors of P3 have areas of 24 pixels squared" (or in this case 3 P3-pixels squared). So each anchor has an area of 9.

The formula for computing the width and height of each anchor is:

width  = scale * sqrt(ar)
height = scale / sqrt(ar)

As you can see, width / height = ar and width * height = scale ** 2


2.

Yup 19248.


3.

We use SSD's regressors (I think the R-CNNs do the same tho).

Check the paper for details, but the jist that we predict [dx, dy, sw, sh]. Then if e.g. ax is the anchor's x we do

x = ax + dx
y = ay + dy
w = aw * exp(sw)
h = ah * exp(sh)

In reality there's some division and multiplication of the variances of each term, but you can check the paper for the gritty details.

timdah commented 4 years ago

Wow really a quick answer and that on a sunday :smiley: Thanks 👍


1.

Yes i mean relative to P3. But the anchors have these fixed sizes on every P layer right? For example ar=1 has 3x3 fixed on P3-P7 and the bigger area covered comes from the lower resolution on deeper layers? P3: anchor1 = (3x8)x(3x8), P4: anchor1 = (3x16)x(3x16) ...


3.

Okay but (ax, ay) is the anchors center coordinate and (dx, dy) the distance from it to the center of the bounding box?

dbolya commented 4 years ago

Ok this one took longer because I went to sleep.


1.

No, the anchor size relative to the conv size is not fixed. Each layer's anchors scales and sizes are computed independently without rounding. In fact, I never compute this Pi-pixel representation--it's all just relative to the image size (so / 550).

The relative size is computed exactly as I described divided by 550. That's it. I used to do it in the way you were describing, but that introduced aliasing, especially with FPN that only has one prediction head used for all Pi. When you use the anchors, you're computing things relative to their pixel width and height, not their Pi width and height, so all you're doing by rounding is introducing aliasing.


3.

Think about it: it doesn't matter. A shift from the top left corner is the same as a shift from the center. All that matters is where the anchor starts (as long as you're consistent in what "x" means for instance), but yeah the anchor starts centered.

timdah commented 4 years ago

Haha, for me it's time now.


1.

I have to admit that I'm confused now. How is it possible to have only one prediction head with FPN? But in the last few hours the prediction head count has become unclear to me anyway. I struggled with it, because I don't know how to come to the 19248 predictions. Refering to figure 4 in your paper the prediction head output WxH predictions. But since we have 3 anchors there must be WxHx3 predictions per layer to get to the 19248 in total. So my first thought was, that there must be 3 prediction heads attached per layer, but that doesn't sound right somehow. An know you say with FPN there is only one prediction head 😄


3.

Okay that makes sense.

dbolya commented 4 years ago

It can be a little confusing, but the figure is correct. Let's look at the box branch for example (since that's what we're talking about anyway). We output W x H x 4a box regression coefficients per layer. a is the number of anchors here, so that's W x H x 3 x 4 coefficients (or W x H x 3 boxes), which falls in line with the calculation you made earlier.

Then yeah, the same prediction head is attached to each Pi (note: same weights, but different anchors for each layer). Since the anchors are different, they need to be correlated for the prediction head to learn anything (otherwise, since it can't tell what layer the features came from and what anchors would be associated with it, it can't learn anything). The way FPN correlates them (and the way that intuitively makes sense) is to have each by 2x bigger than the last, so the anchors depend on the stride of the current Pi.

Now the network doesn't know or care about the anchor box, all it's outputting is some numbers to stretch the box. Intuitively, you want the sw and sh it outputs to be exactly the same if the object is 2x smaller (and the network has to use a smaller anchor), but if we round at the Pi-pixel scale, then this wouldn't be the case (hence the aliasing I mentioned earlier).

timdah commented 4 years ago

Okay I missed the a so the output per Pi is clear now.

I'm struggeling with the shared prediction head over the Pi's. How can the head be modified to match the WxH of each Pi? At P3 the head starts with a 69x69x256 layer and at P4 with 35x35x256, how can this be changed dynamically? And you note they share the weights but is the number of weights in P4 not smaller because of the smaller dimensions? Sorry if I annoy you and the questions are stupid, but I just recently started diving into machine learning and so instance segmentation. Things are overwhelming me sometimes.

dbolya commented 4 years ago

The point of conv layers is that they don't depend on the spatial dimensions. You can pass in any size image to the same convnet and you'll get a valid output with size depending on the input size.

So the same set of convnet weights can be used for any image size (as long as the number of features, in this case 256 is consistent). This is because how convnets are defined: they slide a kernel over an image, and the weights just tell you how to map the input pixels in that kernel to a new pixel in the output. As long as the number of features and kernel size stays the same, the rest can change.

Thus the number of weights is the same for each Pi layer.