Open PierreId opened 3 weeks ago
Hi, I also tried this "reverse" option, and it works fine. E.g, total slices (50), starting from 7, 7->0.
I thought in the beginning, the order between forward and backward for the whole video inference does not matter (forward+backward / backward+forward), but I observed weird behavior which I still cannot figure it out. If I did forward, then.. backward failed to store proper inference values. But this does not happen on the other case (for this one, bidirectional and interchangeable).
I'm trying to figure it out.. what exactly happened there.
In terms of code, I believe the 'prompts from the future' support shows up in the select_closest_cond_frames function, where the docstring says they take stored prompts in order:
a) the closest conditioning frame before
frame_idx
(if any); b) the closest conditioning frame afterframe_idx
(if any); c) any other temporally closest conditioning frames until reaching a total ofmax_cond_frame_num
conditioning frames.
So specifically point (b), where they'll take frames after the current frame index if possible. As far as I can tell, the relative 'position' (in time) of the prompts isn't used computationally (it only affects the preference of which prompts get used if there are lots available).
That being said, from what I've seen the 'conditioned' memory is quite robust in that it can track onto things even when the prompted frame doesn't closely match the frame being tracked (some discussion of this in issue #210). So the relative timing really doesn't matter much.
Thanks @heyoeyo ! I was actually misled by the fact that, by default, the propagate_in_video() function will start processing from the first frame with a prompt. If you set the "start_frame_idx" parameter (e.g. at 0), SAM2 uses the memory bank and detects objects with prompts from the future.
In my tests, with a crowd of similar objects, using the "reverse=True" parameter gives a better performance (=less false positives).
using the "reverse=True" parameter gives a better performance
That's interesting! I was under the impression the reverse
parameter just ran the video tracking backwards (i.e. as if the video was played in reverse), but from what you and @Joeycho say, it sounds like there may be more to it?
No, it's exactly that : reverse=True
only runs the video backward.
For example, in a scenario where you only prompt when an object becomes entirely visible (=not occluded/truncated) and it takes N frames to get to that. From what I have seen, you will get a better performance using reverse=True
on the N-1 first frames rather than rely on the memory bank (from the future). This is especially true if the scene contains many similar objects, which could lead to false positives.
you will get a better performance using reverse=True on the N-1 first frames rather than rely on the memory bank (from the future).
Oh cool, that's a helpful tip!
The paper states that "It is possible for prompted frames to also 'come from the future' relative to the current frame". I found a "reverse" parameter in predictor.propagate_in_video() that seems to only revert the frame order. It that it or did I miss something more complex ?
I guess that the highest performance can be achieved when we add a prompt on an object while it is the biggest (more resolution) and less occluded/truncated. Then we propagate forward (reverse=False) + backward (reverse=True) to track it in the whole video.