Open jeezrick opened 1 week ago
The design seems to carry over from the v1 model, and the paper for that model describes the reasoning in more detail in the appendix (page 17, section: Making the model ambiguity-aware), they say:
"Ambiguity is much rarer with multiple prompts and the three output masks will usually become similar. To minimize computation of degenerate losses at training and ensure the single unambiguous mask receives a regular gradient signal, we only predict a single mask when more than one prompt is given"
The 'single mask' they refer to is the one that gets used when multimask_output=False
(meant for cases where more than 1 prompt is given) and is discarded otherwise.
Also in this V2 there is multimask_output_for_tracking
.
link to code
I wonder why just discard the first mask in masks(multiple mask)? Is it because the first mask only used for single mask output in training so it doesn't apply to multimask output in inference? I don't think it's in the paper. Maybe I missed it, does anyone has an answer? Thanks.