Merging FPN heads vs keeping FPN heads separate

hgaiser commented 6 years ago

Hi there, I'm one of the maintainers of the keras-retinanet implementation of the RetinaNet paper. First off, thanks for releasing your implementation, it made a few things more clear that were difficult to derive from the paper.

While going through your implementation I noticed you keep the FPN heads separate. This way, each level of P3, P4, P5, P6, P7 has their own loss value attached to its regression and classification submodels. In our implementation, we merge the regressions and classifications of each pyramid level and attach one loss function to the concatenated output. Ie., we have one focal loss function, which takes as input the classification values of P3, P4, P5, P6, P7 concatenated and computes the loss for all levels as if it was one large output.

Intuitively I would say this is indeed different. P7 only has a few anchors w.r.t. P3, so any error occurring in P7 is "drowned" by the errors of P3/P4/P5/P6. Keeping pyramid levels separate would treat the errors in each pyramid level equally.

I'm currently running an experiment where I've kept the levels separate to compare the result with that of merging the levels. I will update this issue when I have some results to share.

I was wondering if the authors of RetinaNet have tested merging vs separation of FPN heads, or what they think about the difference between the two?

rbgirshick commented 6 years ago

@hgaiser I'm a little unclear on your description. I'm not sure what it means mathematically to "concatenate and compute the loss". Having separate losses (which are implicitly summed in the total training loss) vs. concatenating and computing losses can only be understood wrt lower-level implementation details of the loss--most importantly how it's normalized. Depending on the normalization, something that looks like "summing" at a high level might implement averaging, e.g. if you sum N losses and each one is individually normalized by dividing it by N. In our implementation, the losses are "separate" (which means they are summed), but each loss is divided by the same total number of foreground examples in the minibatch--so it's actually more like an average loss per foreground example. I think one cannot reason about how this compares with your implementation without diving into the details of the loss function and what normalization is used.

hgaiser commented 6 years ago

Thank you for your reply @rbgirshick

@hgaiser I'm a little unclear on your description. I'm not sure what it means mathematically to "concatenate and compute the loss".

I'll explain through an example: let's say the regression values of P3 are shaped (70000, 4) and the regression values of P4 are shaped (30000, 4). You could concatenate them on the first axis, leading to a blob of regression values shaped (100000, 4). Repeat this for the remaining pyramid levels and the result is one big blob representing all regression values.

In my opinion this makes the implementation easier, since you're not dealing with 5 * 2 blobs and losses, but only with 2 (regression / classification). I was wondering if it would be equal, which I think it is after your explanation.

Having separate losses (which are implicitly summed in the total training loss) vs. concatenating and computing losses can only be understood wrt lower-level implementation details of the loss--most importantly how it's normalized. Depending on the normalization, something that looks like "summing" at a high level might implement averaging, e.g. if you sum N losses and each one is individually normalized by dividing it by N. In our implementation, the losses are "separate" (which means they are summed), but each loss is divided by the same total number of foreground examples in the minibatch--so it's actually more like an average loss per foreground example.

I see, I misinterpreted the code then. I had assumed each loss is normalized by the number of foreground examples from that pyramid level only. Thank you for clearing that up! In that case I think it indeed shouldn't matter whether you compute the loss over all pyramid levels individually, divide by N and then sum - versus concatenating all pyramid levels, computing the loss and then dividing by N.

I think one cannot reason about how this compares with your implementation without diving into the details of the loss function and what normalization is used.

Agreed, though I used a different method in computing the loss, it should presumably result in the same value.

In case you're interested, I ran an experiment where I implemented what I thought was happening in Detectron (dividing by the number of foreground examples for a specific pyramid level) and the mAP didn't reach values higher than 0.250, whereas we got an mAP of 0.338 with our "standard" implementation. I don't really have a proper explanation for this result though..

What I find an interesting question is: does the current method of normalization make sense? Considering the large difference in the amount of anchors between the pyramid levels, the network has much more to gain by improving on P3 as opposed to improving P7. Though I suppose since all pyramid levels share the backbone, optimizing one automatically means optimizing the other. However I wonder if we would completely separate pyramid levels from eachother (ie. they share no layers), would you see a significant difference in contribution among the different pyramid levels?

xmyqsh commented 6 years ago

@hgaiser I have thought these before, thoroughly. As @rbgirshick said, it depends on the implementation details around the loss function. It could be same, no matter separate or concatenate. Certainly, you have also agreed with these.

Second,

What I find an interesting question is: does the current method of normalization make sense? Considering the large difference in the amount of anchors between the pyramid levels, the network has much more to gain by improving on P3 as opposed to improving P7. Though I suppose since all pyramid levels share the backbone, optimizing one automatically means optimizing the other. However I wonder if we would completely separate pyramid levels from eachother (ie. they share no layers), would you see a significant difference in contribution among the different pyramid levels?

I don't think it's correct that Considering the large difference in the amount of anchors between the pyramid levels, the network has much more to gain by improving on P3 as opposed to improving P7 which you said. Although P3 has a smaller scaled and larger number of anchors than P7. It is also difficult for smaller objects to get positive samples than larger objects as the definition of the positive sample with IOU. So it is proper for smaller objects which will be detected on P3 to have smaller and larger number of anchors than larger objects detected on P7.

In a word, the samples of smaller and larger objects are equality. So, it is more properly to normalize the loss totally than on different pyramid levels separately.

Third, share layers of pyramid levels or not? Paper has said there is no big different performance between shared and no shared version. But I think there is another advantage of the shared version. Considering there is a dataset whose median scaled objects which will be detected by P5 is rare. Then, there is lots of parameter redundancy on P5 branch. This is an extreme example, certainly. In other words, it is more parameter efficient for the shared version on the condition of the number of objects of different scales are distributed non-uniform.

hgaiser commented 6 years ago

Thank you @xmyqsh for your response and I agree with you that objects from P3 are harder to regress / classify w.r.t. objects from P7. Whether this makes the difficulty for both levels equal is hard for me to say, but I get your point. As my original question was answered, I'll close this issue. Thanks for the interesting discussion!

hgaiser commented 6 years ago

Something I hadn't considered when making this issue, is that for RetinaNet it doesn't matter if you go about it one way or the other. However, MaskRCNN and FasterRCNN execute their "head" on each pyramid level, which is a lot easier if these levels are kept separate.

facebookresearch / Detectron

Merging FPN heads vs keeping FPN heads separate #174