dbolya / yolact

A simple, fully convolutional model for real-time instance segmentation.
MIT License
5.01k stars 1.32k forks source link

I can not understand protonet. #86

Closed mshmoon closed 5 years ago

mshmoon commented 5 years ago

Now, I work base on your work,but I can't understand the usefulness of protonet,and why the channel number is 32 on protonet. Thank you for you help.

dbolya commented 5 years ago

I recommend you read our paper listed in the readme. It's all in there.

mshmoon commented 5 years ago

I recommend you read our paper listed in the readme. It's all in there.

I have read your paper....

dbolya commented 5 years ago

Protonet generates the prototype masks that we combine into final output masks. Reread section 3 in the paper for a detailed rundown of our whole method.

There are 32 channels in protonet because we're using 32 prototype masks, which we found to be the best mix of speed and performance.

Croooooow commented 5 years ago

I think it's the prototype makes you confused, right? The protonet outputs a set of prototype masks. Actually, the prototypes are 32 candidate masks denoted as [A1, A2, ... , A32]. And the prediction head outputs 32 coefficients denoted as [X1, X2, ... , X32] for each box. Then, for box B, its mask can be obtained by linear combination of these prototypes: B = A1X1+A2X2+...+A32X32 Similarly, mask C: C = A1Y1+A2Y2+...+A32Y32, where [Y1, Y2, ... , Y32] are another set of coefficients for box C.

dbolya commented 5 years ago

Since this has been open so long, I'm going to close it. Feel free to reopen if you have any more questions.

chuong98 commented 4 years ago

Hi @dbolya I would like to ask a bit deeper into Prototype Masks, specifically its interpretation.

In my understanding, Prototype Mask plays the role of 'Principle Components of Masks', and Yolact uses 'coefficient parameters' to linearly combine the components into 1 final mask.

Hence, in the following two extreme cases, we won't need 'Coefficient Branch': (A) number of Prototype=1, that means Prototype Net will predict Foreground and Background (regardless of classes). Then we just crop the Prototype-Mask using the predicted boxes. This approach of course does not work if there are two objects (possibly overlapped) in the same box.

(B) number of Prototype=C (number of classes), e.g. semantic segmentation, then we don't need 'Coefficient Mask', since simply crop the Prototype will give us the mask. In other words, if we add a (conv1x1 layer, k in-channel, C-out channels) after the prototype mask, to directly predict C classes, this is equivalent to combine 'coefficient-mask' with 'prototype'. However, the network is much simpler, without adding extra-branch the head.

My questions are then: (1) I believe the simple (maybe naive) solution (B) should come at first, did you try this? (2) any reason that you separate semantic head into two independent parts, then combine them again later? (3) I am thinking that having 3 branches may add different (possibly conflicting) objectives to the head, hence reduce the performance of Classification + BBox Branch. Otherwise, the box mAP should at least be equal with RetinaNet.

Thank you, and sorry for long question. I hope to discuss further with you about the intuition.

dbolya commented 4 years ago

@chuong98 Good questions! Your analysis in (A) and (B) is correct.

For your questions:

  1. We haven't been able to try the k=1 ablation but in our most recent version of the paper, we go down to 8 prototypes and the performance is as expected worse (see Tab 2b). However, it's not significantly worse, so it may be worth trying out a 1 prototype version. I will note, however, that the coefficients are generated basically for free. It's just a couple extra coefficients (32) on top of the 85 already generated by the object detector (81 for class + 4 for box), and as you can see in that ablation it doesn't affect speed much.

  2. Protonet isn't a semantic head. The information it captures is not semantic, but spatial. See Fig 5 in the paper for a view of what it captures, but in general if the information were semantic it wouldn't be able to distinguish between instances of the same class. This is why we split the mask prediction into two parts. For more in depth reasoning, check out the start of the application section (Sec 3).

  3. We had the same worry as you so we tested this in detail: not training the box branch made mask prediction worse, and not training the mask branch made box prediction worse. So actually it looks like they improve each other. I think the issue with our box mAP is more an implementation / configuration issue. Our backbone is not RetinaNet, so the box mAP is expected to be lower, but even if we try to create RetinaNet's architecture in this codebase with this training configuration, we cannot reproduce RetinaNet's results. So, I think the issue is more on how we compute / normalize our loss functions and maybe some other differences in the implementation as well.