In class PromptEncoder, what's the usage of "not_a_point_embed"?

shipengai commented 1 year ago

if use text prompt encoding , the prompt token = text token+not_a_point_embed ?

piercus commented 7 months ago

I have asked myself the exact same question, while working on the https://github.com/finegrain-ai/refiners lib.

My understanding

You have 2 different types of inputs : boxes and points.

If you consider the output of PromptEncoder, and name the output shape (batch, sequence, 256).

`sequence` behaves like this :	Input	Sequence length
1 box	2 (2-point-box)
N points	N (points) + 1 (not_a_point_embed)
1 box and N points	N (points) + 2 (2-point-box)

The not_a_point_embed is impacting only the N points case

My interpretation

Why is it needed to pad the N-points input with a learned token ?

Give a more direct info about there is no box

Maybe the interpretation of the points is not the same when you have boxes+points vs points alone. This not_a_point_embed gives the information about "is there a box in the prompt?" more directly to every point.

End of string token

This can be viewed also as a end of string token, as you have in NLP. This is not needed for 1 box since it's not possible to have multiple boxes, and when there are points + box, box are at the end of the sequence playing this role. https://github.com/facebookresearch/segment-anything/issues/509#top

Registry

This is highly speculative, but it can be viewed as a kind of a registry in the prompt encoder, only used in the n-points only input situation (see https://arxiv.org/pdf/2309.16588.pdf)

Experiments

Maybe just starting to compare the SAM quality of box-alone vs N-Points alone vs box+n-points.

It would be interesting to check how relevant are the results without , without training and with training, with non-learned not_a_point_embed vs with learned not_a_point_embed.

Please share your thoughts ?

piercus commented 7 months ago

Related to #620 #381 #646

facebookresearch / segment-anything