facebookresearch / sam2

The repository provides code for running inference with the Meta Segment Anything Model 2 (SAM 2), links for downloading the trained model checkpoints, and example notebooks that show how to use the model.
Apache License 2.0
12.64k stars 1.18k forks source link

Compute mask embeddings and similarities #405

Open tcourat opened 1 month ago

tcourat commented 1 month ago

HI,

I wonder what is the proper way to compute the embedding of a manually prompted mask (e.g with a few points), in order to do similarity matching with automatically labelled masks in the rest of the image and even other images.

In SAMv1 paper "D.6. Probing the Latent Space of SAM", it was said " we compute mask embeddings by extracting an image embedding from SAM from an image crop around a mask and its horizontally flipped version, multiplying the image embedding by the binary mask, and averaging over spatial locations.". This is also discussed in https://github.com/facebookresearch/segment-anything/issues/283

Should we keep the same protocol for SAM v2 ?

heyoeyo commented 1 week ago

From playing around with this a bit, I'd say the approach from SAMv1 is mostly usable with v2. However, there are a few things I'd add:

  1. The cropping step is not necessary and maybe even detrimental
  2. Horizontal flipping isn't strictly needed and is not always helpful
  3. In the original discussion, it's mentioned that the features are taken directly from the backbone. This works, but using the regular features can also work (e.g. the output from the image encoder)
  4. SAMv2 generates multiple sets of features (at different resolutions) whereas v1 only has one. For the backbone-only features, the lowest + second-lowest resolution features tend to work best, while only the lowest resolution features from the image encoder tends to work
  5. The smaller sized models often produce better results, with fewer artifacts

Here's an example of the result on a circuit board picture (left) showing the similarity (right) from the lowest-resolution 'backbone-only' features with the v2.1 tiny model, no h-flip or cropping, red areas are high-similarity:

similarity_example