facebookresearch / co-tracker

CoTracker is a model for tracking any point (pixel) on a video.
https://co-tracker.github.io/
Other
2.52k stars 177 forks source link

The result of tracking point from the middle of video is not precise #33

Closed ernestchu closed 9 months ago

ernestchu commented 9 months ago

Hi, thanks for your great work. When I tried you notebook demo. There's some ambiguities when tracking manually selected points.

queries = torch.tensor([
    [0., 400., 350.],  # point tracked from the first frame
    [10., 600., 500.], # frame number 10
    [20., 750., 600.], # ...
    [30., 900., 200.]
])

Unknown-3

Let's say we are interesting in queries[1], which is the index to a point in the 10th frame, so the model should output a trajectory of all (0, 0) and visibility of False from 0 to 9 timestamps. However, when inspecting pred_visibility, the expected behavior only presents at the first four timestamps. (same problem also happens to pred_tracks)

pred_visibility[:, :, 1]

tensor([[False, False, False, False,  True,  True,  True,  True,  True,  True,
          True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         False, False, False,  True,  True,  True,  True,  True,  True,  True,
          True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
          True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
          True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
          True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
          True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
          True,  True]], device='cuda:0')

Why is that? Thanks!

nikitakaraevv commented 9 months ago

Hi @ernestchu, thank you for your question! The model works with sliding windows. As soon as the frame of interest (in this case, the 10th frame) falls within a particular sliding window, the model begins providing visibility predictions for that point throughout the entire window. The sliding window has a size of 8 frames with an overlap of 4 frames, so the frame number 10 falls within the second sliding window. This explains why the visibility is set to "False" only for the first four timestamps in this case (the same is true for trajectories). You can simply discard these predictions if you don't need them.

ernestchu commented 9 months ago

Thanks for your detailed response!