Open shipengai opened 1 year ago
I have asked myself the exact same question, while working on the https://github.com/finegrain-ai/refiners lib.
You have 2 different types of inputs : boxes and points.
If you consider the output of PromptEncoder
, and name the output shape (batch, sequence, 256)
.
sequence behaves like this : |
Input | Sequence length |
---|---|---|
1 box | 2 (2-point-box) | |
N points | N (points) + 1 (not_a_point_embed) | |
1 box and N points | N (points) + 2 (2-point-box) |
The not_a_point_embed
is impacting only the N points
case
Why is it needed to pad the N-points input with a learned token ?
Maybe the interpretation of the points is not the same when you have boxes+points vs points alone. This not_a_point_embed
gives the information about "is there a box in the prompt?" more directly to every point.
This can be viewed also as a end of string token, as you have in NLP. This is not needed for 1 box since it's not possible to have multiple boxes, and when there are points + box, box are at the end of the sequence playing this role. https://github.com/facebookresearch/segment-anything/issues/509#top
This is highly speculative, but it can be viewed as a kind of a registry in the prompt encoder, only used in the n-points only input situation (see https://arxiv.org/pdf/2309.16588.pdf)
Maybe just starting to compare the SAM quality of box-alone vs N-Points alone vs box+n-points.
It would be interesting to check how relevant are the results without , without training and with training, with non-learned not_a_point_embed
vs with learned not_a_point_embed
.
Please share your thoughts ?
Related to #620 #381 #646
if use text prompt encoding , the prompt token = text token+not_a_point_embed ?