Question of the final masks

zkjisj commented 5 months ago

In my understanding num_multimask_outputs mean the number of final masks, as default is 3. I'm confused about the meaning of self.num_mask_tokens, as it is the add of num_multimask_outputs and 1. In the final produce of masks, the shape of output masks seems to be b*self.num_mask_tokens. After that, the postprocess_masks don't change the shape. However, I have seen some implementations finally output 3 masks, as it is the default number of num_multimask_outputs. They take use of the predictor, and it follows the same process as sam.

heyoeyo commented 5 months ago

The extra 4th mask (the 'zeroth' mask in the output) is used when multiple prompts are provided. You can see how it's used in the forward function of the mask decoder.

The paper explains this in a bit more detail in the appendix under the second paragraph of the section: Making the model ambiguity-aware (page 17).

zkjisj commented 5 months ago

@heyoeyo Thanks for your reply, It helps a lot!

facebookresearch / segment-anything

Question of the final masks #765