Closed ChunmengLiu1 closed 10 months ago
I found this work very impressive, especially between SAM + CLIP but unfortunately, there is no clear way to do this fusion. Kindly, ask the maintainers of this work to provide some info. Thank you so much in advance!
Sorry for the late reply.
For @ChunmengLiu1 questions,
Q1: In the CLIP with segment anything, I want to know that you just replace the Maskformer with SAM? A: Yes, we treat MaskFormer and SAM in the same way: generating class-agnostic proposals.
Q2: In the SAM, what kind of prompt do you use? Find everything in the image? A: Yes, that's correct, our demo uses the same setting as SamAutomaticMaskGenerator. One modification is setting points_per_batch=16
to save memory.
Q3: After that, do something to select better scores in an image based on the number of classes? A: After we got the scores of proposals with each class, we initially tried to only keep the best-scored proposal as in MaskFormer. However, we find that the SAM masks are very fine-grained: given a chair class, it can return one leg or one arm of the chair. We use a granularity variable to merge the proposals. By setting granularity with a lower score, we could merge the legs and arms into one chair object. Worth mentioning that the granularity score is experimental and doesn't guarantee having the granularity we want.
For @halqadasi question,
Q4: there is no clear way to do this fusion. A: Please look at the answer of Q3
Hi! Thank you for the interesting work! In the CLIP with segment anything, I want to know that you just replace the Maskformer with SAM? In the SAM, what kind of prompt do you use? Find everything in the image? After that, do something to select better scores in an image based on the number of classes? or other prompt? Thank you for your reply! best