facebookresearch / segment-anything-2

The repository provides code for running inference with the Meta Segment Anything Model 2 (SAM 2), links for downloading the trained model checkpoints, and example notebooks that show how to use the model.
Apache License 2.0
10.63k stars 849 forks source link

Prompts "from the future" #253

Open PierreId opened 3 weeks ago

PierreId commented 3 weeks ago

The paper states that "It is possible for prompted frames to also 'come from the future' relative to the current frame". I found a "reverse" parameter in predictor.propagate_in_video() that seems to only revert the frame order. It that it or did I miss something more complex ?

I guess that the highest performance can be achieved when we add a prompt on an object while it is the biggest (more resolution) and less occluded/truncated. Then we propagate forward (reverse=False) + backward (reverse=True) to track it in the whole video.

Joeycho commented 2 weeks ago

Hi, I also tried this "reverse" option, and it works fine. E.g, total slices (50), starting from 7, 7->0.

I thought in the beginning, the order between forward and backward for the whole video inference does not matter (forward+backward / backward+forward), but I observed weird behavior which I still cannot figure it out. If I did forward, then.. backward failed to store proper inference values. But this does not happen on the other case (for this one, bidirectional and interchangeable).

I'm trying to figure it out.. what exactly happened there.

heyoeyo commented 2 weeks ago

In terms of code, I believe the 'prompts from the future' support shows up in the select_closest_cond_frames function, where the docstring says they take stored prompts in order:

a) the closest conditioning frame before frame_idx (if any); b) the closest conditioning frame after frame_idx (if any); c) any other temporally closest conditioning frames until reaching a total of max_cond_frame_num conditioning frames.

So specifically point (b), where they'll take frames after the current frame index if possible. As far as I can tell, the relative 'position' (in time) of the prompts isn't used computationally (it only affects the preference of which prompts get used if there are lots available).

That being said, from what I've seen the 'conditioned' memory is quite robust in that it can track onto things even when the prompted frame doesn't closely match the frame being tracked (some discussion of this in issue #210). So the relative timing really doesn't matter much.

PierreId commented 2 weeks ago

Thanks @heyoeyo ! I was actually misled by the fact that, by default, the propagate_in_video() function will start processing from the first frame with a prompt. If you set the "start_frame_idx" parameter (e.g. at 0), SAM2 uses the memory bank and detects objects with prompts from the future.

In my tests, with a crowd of similar objects, using the "reverse=True" parameter gives a better performance (=less false positives).

heyoeyo commented 2 weeks ago

using the "reverse=True" parameter gives a better performance

That's interesting! I was under the impression the reverse parameter just ran the video tracking backwards (i.e. as if the video was played in reverse), but from what you and @Joeycho say, it sounds like there may be more to it?

PierreId commented 2 weeks ago

No, it's exactly that : reverse=True only runs the video backward.

For example, in a scenario where you only prompt when an object becomes entirely visible (=not occluded/truncated) and it takes N frames to get to that. From what I have seen, you will get a better performance using reverse=True on the N-1 first frames rather than rely on the memory bank (from the future). This is especially true if the scene contains many similar objects, which could lead to false positives.

heyoeyo commented 2 weeks ago

you will get a better performance using reverse=True on the N-1 first frames rather than rely on the memory bank (from the future).

Oh cool, that's a helpful tip!