Training with In-the-Wild Videos

sauradip commented 1 month ago

Hi ,

Thanks for the awesome work ! i am curious to know a few things :

a) how your code estimates the camera parameters for in the wild videos which does not have any camera information b) How do you lift the trajectory to 3D ? Are you using metric depth to lift ? if so is it not inaccurate ?

coltonstearns commented 1 month ago

Hello, thanks for the questions!

a) In this release of the code, we assume a simple pinhole camera model - a single focal length for both fx and fy, the principle point (cx, cy) as the center of the image, and no skew or distortion. By default, we assume an 80 degree FOV, for which we then compute an appropriate focal length. The code for this is in lines 47-57 of preprocess/01_format_directory.py. For camera extrinsics, we assume the camera is stationary and looking forward, and we learn dynamics for the background to account for camera motion.

b) We use monocular metric depth from DepthAnythingV2. It is usually impressively accurate, although at times exhibits errors (which can cause our method to give a bad reconstruction). Also, instead of directly lifting trajectories to 3D, we initialize per-frame and progressively expand them during our optimization - while a slower optimization, we found this is a bit more robust to depth and tracking errors.

sauradip commented 1 month ago

Thanks for your detailed response ! Just a query, is it possible to optimize the trajectory using a sliding window instead of frame by frame using a fixed camera trajectory as you mentioned in (a) ?

coltonstearns commented 1 month ago

Hi, sorry about my delayed response. I'm not sure I quite understand? Are you asking if we could initialize with a better camera poses (instead of all the camera poses just being stationary and looking forward)?

If so, then yes! You can initialize the camera poses to be whatever you want, and the foreground/background dynamics will attempt to compensate. In some experiments (not yet released in this version of the code), we initialize camera poses with COLMAP estimates, and can achieve even better results. By default though, we use the stationary "identity" camera matrices because it is simpler (and COLMAP often fails in highly dynamic videos).

coltonstearns / dynamic-gaussian-marbles

Training with In-the-Wild Videos #6