Open kevinpl07 opened 3 weeks ago
From what I've seen, there are significant differences between the base/tiny vs. large/small models. If you were using the large model, you may find the base model will work better for this sort of thing (in general the base model seems to more easily segment 'whole object' masks). Of course the web demo may be using yet another (unreleased) model, which could also explain the differences in behavior.
Also, the masking can be very sensitive when using multiple prompts, so if the prompts aren't identically placed, you can get very different masks. For example, here's the mask (using the large model) as you adjust the second point prompt (the mouse cursor): You can see there's a small region just left of the shadow 'corner' in the truck bed that gives the full segmentation, but further left or right is a bit of a mess.
@heyoeyo
Thanks for your input!
I will play around with the other models. I see what you mean with the sensitivity but I tried it like 5 times with different sets of two points and the demo had consistently good results while the raw model had always issues. So I don't think that's the issue here.
the demo had consistently good results while the raw model had always issues
Yes this does seem to be the case! The web demo eventually shows similar problems when adding more points, which I'd guess means it's just a better model rather then some kind of clever pre-processing. There seemed to be a similar sentiment for the SAMv1 models vs. demo, so maybe it's just a policy thing that they reserve the best model internally.
Hi @kevinpl07, Thanks for sharing your observations! When clicks are across multiple objects, the expected behavior is not well-defined.
You mentioned that one use-case could of such a feature could be to capture the gaps between the objects - in this example, it's not clear to me which gap should be closed. And how would we determine the "span" of the gaps to be closed, e.g. is the entirety of the rest of the image a valid closure?
Consider a different example, say there are two groups of people in an image, and we place two clicks on two individuals in a group - are we selecting just the two individuals? are we selecting two shirts? are we selecting the whole group? are we selecting both groups of individuals? are we selecting the whole image? There is ambiguity and what's the "correct mask" is not well-defined.
Can you try visualizing the masks from multimask? One of those 3 masks might be what you are looking for.
Hi @chayryali, thanks for getting back to me!
I think this image describes the issue pretty well. Suppose I want to segment the foreground from the background. Foreground in this case is the human and the car.
Using two objects the model struggles to decide where one ends and the other starts - creating gaps in between:
If however, I just treat them as one object, that problem vanishes:
What I'm trying to say is that there is a valid use case for "merging" objects into one if you want a consistent mask despite occlusion.
@kevinpl07 I'm trying to track multiple objects using the large model, and just wanted to confirm the method of actually passing two points of two different objects. Do you just call predictor.add_new_points_or_box
multiple times on the same state, passing a different object id and corresponding points to it?
If so, I had the same issue of the demo model performing much better than the large model on my local machine. I have a 3090 that I am experimenting with. From the DEMO page.
On my local machine, I instead get poor results when trying to track multiple objects. Like the one below.
Any advice on properly tracking multiple objects?
I don't think Meta has published their best model + workflow.
I'm working on a similar application like your hosted demo. When I use the same objectId for two different objects it still works great in the demo, but not at all when running it myself.
It looks good for a single point:
But for one point on the human and one point on the car it looks like this:
Meanwhile in your demo:
What is the reason for that behaviour? I understand that we should just use separate objectId's but I'm nevertheless interested why it looks so much better in your demo.
EDIT: Actually there is a very valid use case for merging the objects. If you want to classify something as foreground and background then using two objects will have gaps between the objects. This way they are automatically closed.
Any insight is greatly appreciated!