Compute mask embeddings and similarities

facebookresearch / sam2

The repository provides code for running inference with the Meta Segment Anything Model 2 (SAM 2), links for downloading the trained model checkpoints, and example notebooks that show how to use the model.

Apache License 2.0

12.64k stars 1.18k forks source link

From playing around with this a bit, I'd say the approach from SAMv1 is mostly usable with v2. However, there are a few things I'd add:

The cropping step is not necessary and maybe even detrimental
Horizontal flipping isn't strictly needed and is not always helpful
In the original discussion, it's mentioned that the features are taken directly from the backbone. This works, but using the regular features can also work (e.g. the output from the image encoder)
SAMv2 generates multiple sets of features (at different resolutions) whereas v1 only has one. For the backbone-only features, the lowest + second-lowest resolution features tend to work best, while only the lowest resolution features from the image encoder tends to work
The smaller sized models often produce better results, with fewer artifacts

Here's an example of the result on a circuit board picture (left) showing the similarity (right) from the lowest-resolution 'backbone-only' features with the v2.1 tiny model, no h-flip or cropping, red areas are high-similarity:

similarity_example

facebookresearch / sam2

Compute mask embeddings and similarities #405