Efficency of V2 and adding more points?

facebookresearch / co-tracker

CoTracker is a model for tracking any point (pixel) on a video.

Other

2.52k stars 175 forks source link

Hey,

First of all, great work. I really enjoyed reading the paper and going over the code, too. There are two things that I want to understand:

You mentioned that cotracker2 is more efficient than cotracker1. Could you help me understand where that efficiency is coming from? The one place I found a difference was in the transformer to compute the deltas. Looking closely at it, the transformers in v1 and v2 are based on space-time attention; v2 also uses some virtual tracks. So, since the tokens are increasing, they should be heavier on the memory. If true, then how is the new block more efficient? Am I missing something?
In my current use case. I want to add more points as the video progresses, but only in those regions where no visible points are found. One way is to add more query points at certain intervals and remove the redundant points through post-processing, but this would be highly inefficient. Is there a smarter way of going about it?

Hi @pulkitkumar95, thank you!

Even though there're more tokens in the new version, the new virtual tokens serve as an attention bottleneck to reduce memory complexity. Instead of doing quadratic attention between all the tracks (memory cost N*N), we do cross attention between K virtual and N real tracks (memory cost K*N, K = 64 is much smaller than N). This allows us to track up to N = 70.000 points.
We are currently trying to figure out a good way of doing it ourselves. I don't really have a solution here. If you come up with something, please let us know :)

facebookresearch / co-tracker