Open tcourat opened 1 month ago
From playing around with this a bit, I'd say the approach from SAMv1 is mostly usable with v2. However, there are a few things I'd add:
Here's an example of the result on a circuit board picture (left) showing the similarity (right) from the lowest-resolution 'backbone-only' features with the v2.1 tiny model, no h-flip or cropping, red areas are high-similarity:
HI,
I wonder what is the proper way to compute the embedding of a manually prompted mask (e.g with a few points), in order to do similarity matching with automatically labelled masks in the rest of the image and even other images.
In SAMv1 paper "D.6. Probing the Latent Space of SAM", it was said " we compute mask embeddings by extracting an image embedding from SAM from an image crop around a mask and its horizontally flipped version, multiplying the image embedding by the binary mask, and averaging over spatial locations.". This is also discussed in https://github.com/facebookresearch/segment-anything/issues/283
Should we keep the same protocol for SAM v2 ?