facebookresearch / segment-anything

The repository provides code for running inference with the SegmentAnything Model (SAM), links for downloading the trained model checkpoints, and example notebooks that show how to use the model.
Apache License 2.0
47.31k stars 5.6k forks source link

In class PromptEncoder, what's the usage of "not_a_point_embed"? #509

Open shipengai opened 1 year ago

shipengai commented 1 year ago

if use text prompt encoding , the prompt token = text token+not_a_point_embed ?

piercus commented 7 months ago

I have asked myself the exact same question, while working on the https://github.com/finegrain-ai/refiners lib.

My understanding

You have 2 different types of inputs : boxes and points.

If you consider the output of PromptEncoder, and name the output shape (batch, sequence, 256).

sequence behaves like this : Input Sequence length
1 box 2 (2-point-box)
N points N (points) + 1 (not_a_point_embed)
1 box and N points N (points) + 2 (2-point-box)

The not_a_point_embed is impacting only the N points case

My interpretation

Why is it needed to pad the N-points input with a learned token ?

Give a more direct info about there is no box

Maybe the interpretation of the points is not the same when you have boxes+points vs points alone. This not_a_point_embed gives the information about "is there a box in the prompt?" more directly to every point.

End of string token

This can be viewed also as a end of string token, as you have in NLP. This is not needed for 1 box since it's not possible to have multiple boxes, and when there are points + box, box are at the end of the sequence playing this role. https://github.com/facebookresearch/segment-anything/issues/509#top

Registry

This is highly speculative, but it can be viewed as a kind of a registry in the prompt encoder, only used in the n-points only input situation (see https://arxiv.org/pdf/2309.16588.pdf)

Experiments

Maybe just starting to compare the SAM quality of box-alone vs N-Points alone vs box+n-points.

It would be interesting to check how relevant are the results without , without training and with training, with non-learned not_a_point_embed vs with learned not_a_point_embed.

Please share your thoughts ?

piercus commented 7 months ago

Related to #620 #381 #646