Open 25benjaminli opened 9 months ago
Have you solved it? I also want to know the answer to that question
@MyFirst905 I have not "solved it" but have a rough idea as to why this is the case. According to the paper:
"With one output, the model will average multiple valid masks if given an ambiguous prompt. To address this, we modify the model to predict multiple output masks for a single prompt (see Fig. 3). We found 3 mask outputs is sufficient to address most common cases (nested masks are often at most three deep: whole, part, and subpart). During training, we backprop only the minimum loss over masks. To rank masks, the model predicts a confidence score (i.e., estimated IoU) for each mask"
If I am interpreting this correctly, the extra multimask outputs are supposed to describe different levels of detail.
Why is num_mask_tokens = num_multimask_outputs + 1? And why is it that when you use multimask output, it slices from (1, None)?