ClementPinard / SfmLearner-Pytorch

Pytorch version of SfmLearner from Tinghui Zhou et al.
MIT License
1.01k stars 224 forks source link

Questions regarding pose evaluation and depth scale ambiguity #104

Closed scott89 closed 4 years ago

scott89 commented 4 years ago

Hi Clément,

Really great work! I have the following two questions.

  1. In my understanding, the groundtruth poses of KITTI Odometry are measured under the camera_00 coordinate, while the posenet is tested using the image sequences captured by camera_02. Shouldn't the predicted poses be first transformed to the camera_00 coordinate before computing the metrics?
  2. The scale ambiguity of self-supervised learning is mentioned in #45 (i.e., a scaled inverse depth with the correctly scaled pose will lead to correct image warping). Though the translation may be scaled, the rotation matrix is converted from Euler angles predicted by the posenet and cannot be scaled. I am wondering whether this Euler angles based parameterization can partially address the scale ambiguity.

Thanks!

ClementPinard commented 4 years ago

Hi, thanks for your interest in this repo.

  1. Yes you are right, good catch ! However, this would only be significant with great rotation values. And since the snippet are only 5 frames long, the result we get are still valid IMO. It's also important to remeber that the KITTI itself said that the pose measurement were not great if we took small snippets. But on the principle it's completely true that we should change the output of Posenet and convert it to the camera_00_coordinate.

  2. Not sure what you mean by that. Rotation has no ambiguity, contrary to translation, so even if both translation and rotation come from the same network, they are to be treated differently when it comes to deduce the right depth. The only way I found to adress this problem with posenet is to measure actual translation magnitude, from KITTI OXTS, by using either gps or speed from wheels. More generally, to solve the translational ambiguity, you will need at some point an anchor to reality, because every thing behaves the same way for a camera if you are using a real car, or if you are e.g. using a toy car in a reduced model of a city. The original author chose the Lidar by using the median of ground truth depth map, and this repo propose either this (which is not ideal for a system that tries to estimate depth without supervision) or the car's speed (ubiquitous and far more reasonable).

scott89 commented 4 years ago

Thanks for your reply. That solves my questions!