Closed zzzyzh closed 1 year ago
Thank you for your interest!
For example, given an image with the size of (518, 518), the image encoder (we adopt DINOv2, a pre-trained ViT model, by default) with 14 $\times$ 14 patch_size will encode the image into patch-level feature sized by (518/14, 518/14). You can find the details of DINOv2 here.
The center prompts are the point prompt that encourages SAM to segment the object covering the whole image. First, we can obtain the matched points by Patch-level Matching. Then, we cluster the matched points based on their locations into K clusters with k-means++ (You can use other clustering algorithms), and we can obtain K cluster centers. The center prompts are sampled within these cluster centers.
Thank you for your patience and kindness!
Thank you for your outstanding work!
Can you please describe how patch-level features are generated and how they are sized? Also, I'd like to ask what the center prompt means and how the model generates it.
Your excellent will be a great help to my research!