Velocity Supervision Clarification

TRI-ML / packnet-sfm

TRI-ML Monocular Depth Estimation Repository

MIT License

1.25k stars 243 forks source link

Hello,

Thank you for sharing this amazing work. I have a quick question for clarification on velocity supervision.

From velocity_loss.py line 34 (gt_trans = [pose[:, :3, -1].norm(dim=-1) for pose in gt_pose_context]), it appears that the velocity loss uses the difference in Euclidean distance between the predicted and ground truth pose between images. Is this correct?

On the other hand, Equation (6) from "3D Packing for Self-Supervised Monocular Depth Estimation" appears to suggest that this supervision comes from "the measured instantaneous velocity scalar v multiplied by the time difference between target and source frames..." Is this calculation performed somewhere else? Are these equivalent? Did you try velocity and it didn't work as well as the GT pose? Any clarification is very much appreciated.

Thanks again! Brent

In my humble opinion, eventually, the instantaneous velocity will be multiply with the time different of the frame to get the relative displacement between the frame.

The reason they do this

From velocity_loss.py line 34 (gt_trans = [pose[:, :3, -1].norm(dim=-1) for pose in gt_pose_context]), it appears that the velocity loss uses the difference in Euclidean distance between the predicted and ground truth pose between images. Is this correct?

is because

Hi, thank you for the interest in our repository! You don't need to provide ground-truth pose, only instantaneous velocity. We provide the full 4x4 matrix because that is available, but we don't use it. Our pred_trans and gt_trans only take the last column, which contains translation, and that's what is being used to calculate the loss.

Originally posted by @VitorGuizilini-TRI in https://github.com/TRI-ML/packnet-sfm/issues/91#issuecomment-731652275

TRI-ML / packnet-sfm

Velocity Supervision Clarification #135