Multi-objects' masks from SAM predictor

MLDeS commented 1 year ago

Thanks for the wonderful work. I have a question, apologies if I have missed the information.

As I understand, SAM is class-agnostic. Is it correct to assume that SAM does not support multi-object segmentation, but rather binary segmentation and the "different" masks generated by the SAM predictor for different objects (I am not talking about the multi-masks for confidence scores) in an image are essentially generated sequentially? In other words, does the SAM predictor loop over the different objects' prompts (such as point prompts, bbox prompts, etc.), treating each one as a separate entity, and generate a binary mask (foreground vs. background) for each of these objects individually?

swframe commented 1 year ago

There are several really good videos on youtube that explain how it works. I liked this one: https://www.youtube.com/watch?v=OhxJkqD1vuE He explains it so well I don't want to try it here.

MLDeS commented 1 year ago

Thanks for pointing to the video, indeed the explanation is awesome. However, my question was not exactly on how the binary segmentation masks are generated. It is beyond the explanations in the video. The video was confined to how the SAM masks are generated for one object. My question was about the multi-object masks that we see, and if they are generated sequentially by looping over the different objects' prompts and producing a binary segmentation mask by treating each of the objects separately (which I presume is) because SAM is class-agnostic. And this leads me to think what the SAM model actually learns? Because, it does not look like it learns to separate different instances, or ? Would be nice to have an explanation.

heyoeyo commented 1 year ago

For a single prompt, if you give both points and a bounding box prompt, it combines them into a single 'input' into the model (i.e. they are not handled sequentially), you can see this happening inside the predict_torch function, which itself calls the prompt encoder where you can see that the points/box encodings are concatenated together into the sparse_embeddings. These embeddings go on to the mask decoder for further processing into the tokens, but they're treated as a single block of data, not separate inputs.

On the other hand, if you're talking about the automatic mask generator stuff, that does seem to process the image using a grid of point prompts (sequentially), and then does some extra processing to clean up the combined results, though I'm not really familiar with the details there.

As for what it learns, I think it's fair to say it's trained to behave like a very good 'magic wand tool'. It learns to guess what part of an image you're interested in, given a minimal hint (e.g. single point on an object in the image, or bounding box around something in the image).

madhu-basavanna commented 10 months ago

Thanks for pointing to the video, indeed the explanation is awesome. However, my question was not exactly on how the binary segmentation masks are generated. It is beyond the explanations in the video. The video was confined to how the SAM masks are generated for one object. My question was about the multi-object masks that we see, and if they are generated sequentially by looping over the different objects' prompts and producing a binary segmentation mask by treating each of the objects separately (which I presume is) because SAM is class-agnostic. And this leads me to think what the SAM model actually learns? Because, it does not look like it learns to separate different instances, or ? Would be nice to have an explanation.

I'm also looking for an answer to the same question. If you find anything else, please share.

facebookresearch / segment-anything

Multi-objects' masks from SAM predictor #463