facebookresearch / segment-anything

The repository provides code for running inference with the SegmentAnything Model (SAM), links for downloading the trained model checkpoints, and example notebooks that show how to use the model.
Apache License 2.0
46.54k stars 5.52k forks source link

How do I get the single most accurate foreground mask? #668

Open hanjoonwon opened 7 months ago

hanjoonwon commented 7 months ago

I want to use this script: python scripts/amg.py --checkpoint /home/joonwon/segment-anything/checkpoint/sam_vit_h_4b8939.pth --model-type vit_h I want to get foreground mask for my 52 images, but when I run the existing code on my ubuntu anaconda, it generates too many masks and it is hard to combine them into one. How can I get the most accurate one mask per image?

heyoeyo commented 7 months ago

It's generally tricky to get something like 'the most accurate mask' since that's subjective. Taken very literally, you could modify the amg code, right after line 221:

masks = generator.generate(image)
masks = [max(masks, key = lambda m: m["predicted_iou"])]

Which should give you a single mask with the highest IoU prediction (and therefore the one considered 'most accurate' by the model itself). However, this may not match with your idea of which mask is best/most accurate. If you're specifically looking for foreground elements, then you may prefer the largest mask, which you can get by changing the code to something like:

masks = generator.generate(image)
masks = [max(masks, key = lambda m: m["area"])]

Again though, this may not match up with your preferences.

For 52 images, I think it would only take a few minutes to manually generate good masks by giving box/point prompts (as opposed to using the automatic mask generator) using a UI (like this one? I haven't actually tried it, but it looks like it could work).

If it needs to be automated, then it might be better to try something like grounded SAM, which let's you provide a text prompt to specify what you want segmented. Or otherwise, for foreground stuff specifically, maybe using a depth-prediction model like MiDas or Zoe (and thresholding the depth map to only get elements 'close to camera') could work?