aim-uofa / Matcher

[ICLR'24] Matcher: Segment Anything with One Shot Using All-Purpose Feature Matching
https://arxiv.org/abs/2305.13310
Other
445 stars 26 forks source link

Patch Level #6

Closed zzzyzh closed 1 year ago

zzzyzh commented 1 year ago

Thank you for your outstanding work!

Can you please describe how patch-level features are generated and how they are sized? Also, I'd like to ask what the center prompt means and how the model generates it.

Your excellent will be a great help to my research!

yangliu96 commented 1 year ago

Thank you for your interest!

For example, given an image with the size of (518, 518), the image encoder (we adopt DINOv2, a pre-trained ViT model, by default) with 14 $\times$ 14 patch_size will encode the image into patch-level feature sized by (518/14, 518/14). You can find the details of DINOv2 here.

The center prompts are the point prompt that encourages SAM to segment the object covering the whole image. First, we can obtain the matched points by Patch-level Matching. Then, we cluster the matched points based on their locations into K clusters with k-means++ (You can use other clustering algorithms), and we can obtain K cluster centers. The center prompts are sampled within these cluster centers.

zzzyzh commented 1 year ago

Thank you for your patience and kindness!