Closed zkjisj closed 5 months ago
The extra 4th mask (the 'zeroth' mask in the output) is used when multiple prompts are provided. You can see how it's used in the forward function of the mask decoder.
The paper explains this in a bit more detail in the appendix under the second paragraph of the section: Making the model ambiguity-aware (page 17).
@heyoeyo Thanks for your reply, It helps a lot!
In my understanding num_multimask_outputs mean the number of final masks, as default is 3. I'm confused about the meaning of self.num_mask_tokens, as it is the add of num_multimask_outputs and 1. In the final produce of masks, the shape of output masks seems to be b*self.num_mask_tokens. After that, the postprocess_masks don't change the shape. However, I have seen some implementations finally output 3 masks, as it is the default number of num_multimask_outputs. They take use of the predictor, and it follows the same process as sam.