Why can we just use the first row of the right rotation matrix as our translation vector?

chrisoffner3d commented 5 months ago

Equation (7) in the paper states that $\boldsymbol{R} = \boldsymbol{R}_r^{-1} \boldsymbol{R}l$ and $\boldsymbol{t} = -\boldsymbol{r}\{r, 1}.$

Taking the x-axis of the camera as the translation vector seems appropriate for a rectified and fronto-parallel camera setup, but here we are trying to predict the pose of an arbitrary camera configuration. In the general setting, the cameras are not necessarily offset only along the x-axis of the right camera.

Below you can see some plots from an experiment I ran, using synthetic data with known ground-truth extrinsics ("poses"). In my experiments using this method, the translation error $e_t$ is significantly greater than the rotation error $e_R$ in most frames. By "translation error" and "rotation error", I mean the angular difference (geodesic distance) between the ground-truth translation/rotation, and the estimated translation/rotation.

Following the convention popularised for the annual Image Matching Challenge (IMC) by [Jin et al., 2020], the pose error would be the maximum of these angular rotation and translation errors. In my experiments, this pose error is determined almost always by the translation error because $\max(e_R, e_t) = e_t$ in most cases.

Is appears to me that this method produces highly accurate rotation estimates but does not estimate a translation vector $\boldsymbol{t}$ that closely matches the ground truth translation $\boldsymbol{t}_\text{gt}$ because it assumes that the cameras are offset only along the x-axis of the right camera. However, cameras can be (and often are) also offset along the optical (z-)axis of either camera, and/or along the up/down (y-)axis.

I don't think that I have a good understanding of this though, and I may be missing some crucial aspects here. If you could provide some clarification, I would highly appreciate it. Thank you and best regards!

TongjiZhaohb commented 5 months ago

Our paper focuses exclusively on the online calibration of stereo cameras. For stereo cameras, the aim of rectification is to satisfy Equation (6). Consequently, we can derive Expression (7) from Equations (1), (5), and (6). Regarding the other scenarios you mentioned, our algorithm does encounter limitations. I am currently investigating this issue further and appreciate your constructive feedback!

chrisoffner3d commented 5 months ago

Thank you for your response. :) I'm not entirely sure what "other scenarios" you're referring to, as I do believe that I am also and exclusively talking about online calibration of stereo cameras.

I'm very impressed and intrigued by your technique, and the pose error (the maximum of the respective angular estimation errors for rotation and translation, i.e. $\max(e_R, e_t) $) is considerably lower than for other online pose calibration methods I've tested – most of which estimate the fundamental matrix $\mathbf{F}$ instead of directly optimising for the pose $(\mathbf{R}, \mathbf{t})$ that provides the optimal rectifying homography.

I understand and agree that the camera coordinates ${\boldsymbol{p}_l^C}'$ and ${\boldsymbol{p}_r^C}'$ of the rectified points need to satisfy equation (6), i.e. $${\boldsymbol{p}_l^C}' = {\boldsymbol{p}_r^C}' + \boldsymbol{i}_1.$$

However, I unfortunately cannot quite follow how combining equations (5), (6), and (1) results in equation (7).

Unless I'm mistaken, the first row $\boldsymbol{r}_{r,1} = \left[r{11}\ \ r{12}\ \ r_{13} \right]$ of the right rotation matrix

\boldsymbol{R}_r = \begin{bmatrix}
  r_{11} & r_{12} & r_{13} \\
  r_{21} & r_{22} & r_{23} \\
  r_{31} & r_{32} & r_{33}
  \end{bmatrix} = \begin{bmatrix} \boldsymbol{r}_{r,1} \\ \boldsymbol{r}_{r,2} \\ \boldsymbol{r}_{r,3} \end{bmatrix}

represents the direction of the x-axis of the camera in the world coordinate system. However, the true relative pose $(\mathbf{R}, \mathbf{t})$ of the cameras may be such that the left camera does not lie along the x-axis of the right camera (or vice versa).

I've created the video below as a visual demonstration. Starting with a single camera, I duplicate it and offset it along the camera's x-axis (the bright red line through the right camera's optical centre). In this constellation, the left camera is indeed on the line defined by $\boldsymbol{t} = -\boldsymbol{r}_1.$ However, I then additionally offset the left camera along its z-axis, i.e. its optical axis (the blue line). Now it is no longer the case that the relative translation between the cameras can be described by $\boldsymbol{t} = -\boldsymbol{r}_{r,1}.$

Please let me know if I'm just missing something elementary here, or if this is indeed something that the method currently does not account for.

Thank you again, both for your nice paper and for investigating this issue.

https://github.com/TongjiZhaohb/StereoCalibrator/assets/18167754/8058b233-2ae8-46f7-8ba4-61fc4924a48e

TongjiZhaohb commented 5 months ago

I have provided an explanation for Equation (7) in the image below. Please take a look to see if it resolves your confusion.

chrisoffner3d commented 5 months ago

Thank you very much. That does indeed help me follow the derivation of equation (7) and $\boldsymbol{t} = - \boldsymbol{r}_{r, 1}.$

I'll have to think about this some more because I still cannot quite reconcile this formal derivation with the geometric intuition I have outlined above.

Thank you again for your helpful response, I greatly appreciate it!

chrisoffner3d commented 5 months ago

https://github.com/TongjiZhaohb/StereoCalibrator/assets/18167754/07d65673-706d-4fae-8a9e-b46aac8ed2dd

Here is another attempt for me to make sense of this. In the video above I perform the following steps:

Show two cameras with arbitrary rotations.
The epipolar lines are not horizontal and parallel as the y-coordinates of the red, green and blue points clearly do not line up between the left and right image.
I "undo" the rotation, which is equivalent to computing $\boldsymbol{R}_l \boldsymbol{p}_l^C$ and $\boldsymbol{R}_r \boldsymbol{p}_r^C.$
Now the y-coordinates of the red, green and blue points clearly do line up between the left and right image, and equation (6) ${\boldsymbol{p}_l^C}' = {\boldsymbol{p}_l^C}' + \boldsymbol{i}_1$ holds.
However, I then offset the left camera along the world z-axis (up) and along the world y-axis (forward).
The equation ${\boldsymbol{p}_l^C}' = {\boldsymbol{p}_l^C}' + \boldsymbol{i}_1$ now no longer holds because ${\boldsymbol{p}_l^C}'$ is not just ${\boldsymbol{p}_r^C}'$ shifted along the x-axis, but shifted along all three axes.

To clarify: this is because ${\boldsymbol{p}_l^C}'$ and ${\boldsymbol{p}_r^C}'$ are expressed in the (rectified) left and right camera coordinate systems, respectively, and the left camera (i.e. the origin of the left rectified camera coordinate system) is now higher up and closer to the points than the right camera (i.e. the origin of the right rectified camera coordinate system).

My thinking is now that equation (6) may be making an assumption that must not always hold in the general case (i.e. for arbitrary camera configurations). While it is valid to treat $\boldsymbol{t}$ as a vector of unit length due to the inherent scale ambiguity, we cannot – in the general case – know its direction, and thus just setting $\boldsymbol{t} = \boldsymbol{i}_1 = [1, 0, 0]^\top$ may misrepresent the direction of $\boldsymbol{t}.$

TongjiZhaohb commented 5 months ago

Your point is correct. Our work mainly focuses on the online calibration of stereo cameras, which always consist of a pair of cameras, one on the left and one on the right. Therefore, we made the assumption in (6). In other situations, such as when the camera only moves along the z-axis and captures images in consecutive frames, our method will not be able to perform the calibration. This is a limitation of our approach, and we appreciate your suggestion!

chrisoffner3d commented 5 months ago

Thank you for the clarification.

Our work mainly focuses on the online calibration of stereo cameras, which always consist of a pair of cameras, one on the left and one on the right.

I don't fully agree with this as a general statement because "always" is a dangerous word. :) To give just one example, OpenCV's stereoRectify function distinguishes between the two cases "Horizontal Stereo" and "Vertical Stereo" (see screenshot below of the linked documentation). I have encountered several stereo camera rigs where the cameras were offset not only along the x-axis (left/right) but also, to a lesser extent, along the y-axis (up/down).

But I agree that the most common and canonical setup is as you describe: the cameras are offset primarily along the x-axis, and so your assumption is reasonable. If you plan to do a revision of the paper at any point, I would only propose to make this assumption more explicit.

Thank you again for your paper, and for helping clarify my questions here. I appreciate it!

TongjiZhaohb / StereoCalibrator

Why can we just use the first row of the right rotation matrix as our translation vector? #3