Open aartykov opened 1 month ago
Hello, thanks for the question! We can handle a freely moving camera, although fast and extensive camera motion will certainly cause our method to do worse. Our ability to find correspondences through time largely relies on CoTracker and TrackAnything (although the RGB and depth rendering losses likely help a little as well). Furthermore, our rigidity priors can help aggregate noisy CoTracker information into cleaner motion.
Regarding CoTracker though, we did find it frequently fails to estimate good tracking (both with a moving camera and without). One way we mitigate the effect of its errors is by using "shorter" CoTracker predictions (which are usually higher quality) - that is, we supervise with CoTracker predictions that only span 12 frames (because we found tracking can get much worse after ~12 frames of tracking).
Hello again. Thanks for the detailed answer. Does splitting the video into 12 frame chunks help to handle extensive camera motion as well?
Only supervising with CoTracker predictions that span 12 frames helps provide more accurate tracking supervision. However, the length of the CoTracker supervision is independent of how we divide the video into "chunks". For instance, we could have Gaussians trajectories that span 128 frames, but apply our tracking loss on a various length-12 subsequences of the trajectories.
However, as you mention, it is also important to split the video up into "chunks". In fact, with highly dynamic content (such as the DyCheck dataset), we find it best to only learn Gaussian trajectories that span chunks of 8 frames on the foreground and 32 frames on the background.
Hello! Thanks for you quick response. Let me ask a question that is a bit out of scope. In the case of freely moving monocular camera, when I backproject the CoTracker predictions into 3D by using depth maps, I suspiciously get varying 3d coordinates for the same static point over time. However, I would expect the static point have the same 3d coordinates over time. Can you please comment on this? Thanks!
Hello again,
Thanks for the follow up! So this is to be expected because we do not have camera poses (we pretend the camera is stationary). As an intuitive analogy, think of it like being on a train. From your perspective, the landscape is moving. If you look at a tree in front of you, it will be close at that time but far away the next second (even though the tree is not actually moving). Similar, we model the camera as if it is the train: the "stationary background" appears to move.
We could instead estimate and use camera poses, thereby reasoning in a reference frame where static content is in-fact static. However, estimating camera poses can be challenging and sometimes outright fail, so we opted for the camera's frame of reference. Nevertheless, I am currently working on code to estimate the camera poses when possible, since it is much easier to learn motion when we don't have to worry about the "static" background moving.
Hello! Thanks for the cool work! How does the method handle when we freely move the camera through the scene? Because point tracking methods such as CoTracker are negatively affected by camera movement.