facebookresearch / segment-anything-2

The repository provides code for running inference with the Meta Segment Anything Model 2 (SAM 2), links for downloading the trained model checkpoints, and example notebooks that show how to use the model.
Apache License 2.0
10.73k stars 865 forks source link

Different behaviour in hosted demo than local #252

Open kevinpl07 opened 3 weeks ago

kevinpl07 commented 3 weeks ago

I'm working on a similar application like your hosted demo. When I use the same objectId for two different objects it still works great in the demo, but not at all when running it myself.

It looks good for a single point:

Screenshot 2024-08-22 at 17 34 29

But for one point on the human and one point on the car it looks like this:

Screenshot 2024-08-22 at 17 33 07

Meanwhile in your demo:

Screenshot 2024-08-22 at 17 33 14

What is the reason for that behaviour? I understand that we should just use separate objectId's but I'm nevertheless interested why it looks so much better in your demo.

EDIT: Actually there is a very valid use case for merging the objects. If you want to classify something as foreground and background then using two objects will have gaps between the objects. This way they are automatically closed.

Any insight is greatly appreciated!

heyoeyo commented 3 weeks ago

From what I've seen, there are significant differences between the base/tiny vs. large/small models. If you were using the large model, you may find the base model will work better for this sort of thing (in general the base model seems to more easily segment 'whole object' masks). Of course the web demo may be using yet another (unreleased) model, which could also explain the differences in behavior.

Also, the masking can be very sensitive when using multiple prompts, so if the prompts aren't identically placed, you can get very different masks. For example, here's the mask (using the large model) as you adjust the second point prompt (the mouse cursor): promptexample You can see there's a small region just left of the shadow 'corner' in the truck bed that gives the full segmentation, but further left or right is a bit of a mess.

kevinpl07 commented 3 weeks ago

@heyoeyo

Thanks for your input!

I will play around with the other models. I see what you mean with the sensitivity but I tried it like 5 times with different sets of two points and the demo had consistently good results while the raw model had always issues. So I don't think that's the issue here.

heyoeyo commented 3 weeks ago

the demo had consistently good results while the raw model had always issues

Yes this does seem to be the case! The web demo eventually shows similar problems when adding more points, which I'd guess means it's just a better model rather then some kind of clever pre-processing. There seemed to be a similar sentiment for the SAMv1 models vs. demo, so maybe it's just a policy thing that they reserve the best model internally.

chayryali commented 2 weeks ago

Hi @kevinpl07, Thanks for sharing your observations! When clicks are across multiple objects, the expected behavior is not well-defined.

You mentioned that one use-case could of such a feature could be to capture the gaps between the objects - in this example, it's not clear to me which gap should be closed. And how would we determine the "span" of the gaps to be closed, e.g. is the entirety of the rest of the image a valid closure?

Consider a different example, say there are two groups of people in an image, and we place two clicks on two individuals in a group - are we selecting just the two individuals? are we selecting two shirts? are we selecting the whole group? are we selecting both groups of individuals? are we selecting the whole image? There is ambiguity and what's the "correct mask" is not well-defined.

Can you try visualizing the masks from multimask? One of those 3 masks might be what you are looking for.

kevinpl07 commented 2 weeks ago

Hi @chayryali, thanks for getting back to me!

I think this image describes the issue pretty well. Suppose I want to segment the foreground from the background. Foreground in this case is the human and the car.

Using two objects the model struggles to decide where one ends and the other starts - creating gaps in between:

Screenshot 2024-08-30 at 13 38 06

If however, I just treat them as one object, that problem vanishes:

Screenshot 2024-08-30 at 13 40 57

What I'm trying to say is that there is a valid use case for "merging" objects into one if you want a consistent mask despite occlusion.

divineSix commented 2 weeks ago

@kevinpl07 I'm trying to track multiple objects using the large model, and just wanted to confirm the method of actually passing two points of two different objects. Do you just call predictor.add_new_points_or_box multiple times on the same state, passing a different object id and corresponding points to it?

If so, I had the same issue of the demo model performing much better than the large model on my local machine. I have a 3090 that I am experimenting with. image From the DEMO page.

On my local machine, I instead get poor results when trying to track multiple objects. Like the one below. brk-s1

Any advice on properly tracking multiple objects?

scchess commented 1 day ago

I don't think Meta has published their best model + workflow.