clarification - Githubissues

KaimingHe commented 7 years ago

Hi Charles,

Thank you for your interest and implementing Mask R-CNN!

I would like to clarify some descriptions in your Readme (which may suggest misunderstanding of our work):

"The original work involves two stages, a pyramid Faster-RCNN for object detection and another network (with the same structure) for instance level segmentation."

This is not true. In our original work, object detection and instance segmentation are in one stage. They are in parallel, and they are two tasks of a multi-task learning network.

I hope this will ease your effort of a correct reproduction.

parhartanvir commented 7 years ago

@CharlesShang I cannot find any issues with in particular over the repo. If in case you require any help in a particular direction, let know.

CharlesShang commented 7 years ago

I must misunderstand some details in your paper.

For convenient ablation, RPN is trained separately and does not share features with Mask-RCNN, ... In 3.1 Implementation Details

CharlesShang commented 7 years ago

@parhartanvir @KaimingHe Great!!!! I have some questions.

FPN:
1. In a FPN, rois are extracted from multiple layers. In training stage, we choose some rois according to some criterions, like IoU, fractions of foreground, total number, etc.. I'm not sure about these parameters. I guess they are the same as the work 'Feature Pyramid Networks for Object Detection' FPN
2. There are several RPN in the pyramid, when building the losses, should I merge all the ROIs before sampling, or sample ROIs for each RPN then compute the losses..
Mask
1. In the original paper, Figure 3. (page 4), There are only 80 channels in the mask, but I think it should be 81 because there's another background class.
Loss
1. per-pixel sigmoid with binary cross-entropy loss, I guess it is loss_mask = cross_enctropy(sigmoid(x), y) where x and y are of shape (28, 28, 81, 2) # using the last axis to denote fg and bg Am I right?
Training mini-batch >= 2
1. Since the input images may have different shapes, I guess training on a mini-batch of 2 should be like, FP several images seperately in parallel, compute average gradients, then update the network. Is it right?

Sorry for the delayed replay, just back from a vacation

xqms commented 7 years ago

@CharlesShang: Thanks for your effort to implement this very nice work!

In the original paper, Figure 3. (page 4), There are only 80 channels in the mask, but I think it should be 81 because there's another background class.

As far as I understood, the branch predicts binary segmentation masks for each object class - so there is no need for a background mask.

parhartanvir commented 7 years ago

@CharlesShang , I believe for the mask there should not be a background class. That is because there are K binary masks for each of the K classes. Having a background class for the Faster-RCNN / Region proposal part makes sense. But since they are not computing mask loss in between classes, a background mask is not needed.

As far as training goes, I think what you are saying is right. i.e. forward pass the images, add/average gradients then backward pass.

I apologize, I haven't gone through the FPN paper yet. I'll go through and see, if I can help.

CharlesShang commented 7 years ago

@parhartanvir @xqms Thank you for your explaination I think there's little difference. Consider an example -- segment a roi of a horse In the refined stage, we just know it's a horse, so we just check the horse-channel in the masks, pixels with prob greater than 0.5 are considered as horse, otherwise bg. In the process, the bg-channel is never be used at both testing and training stage, since only positive rois are extracted for training and testing.

For consistance, I'll adopt K+1 classes, so we dont need to - 1 when we extract masks

xmyqsh commented 7 years ago

@CharlesShang

There are several RPN in the pyramid, when building the losses, should I merge all the ROIs before sampling, or sample ROIs for each RPN then compute the losses..

I have gone over the FPN paper. I think just one RPN is OK. ancher_targer_layer with inputs of P2 through P5, generate anchors and merge together and random sample inside, normal proposal_layer and proposal_target_layer follows.

For backforward, assign each RoI of width w and height h (on the input image to the network) to the level Pk of our feature pyramid by eqn(1).

I think it is not elegant and time consumming with four RPNs followed by four heads. And it is hard to trade off the four parts.

CharlesShang / FastMaskRCNN

clarification #3