Possibly incorrect ConvLSTM warping

sriramsk1999 commented 2 years ago

Hello, thanks for releasing your work. I had a question regarding the implementation of the depth warping:

As I understand it, this is the current flow of the code during training:

Use current depth D_t as the current estimate. As mentioned in the paper, this is done for stabilization of training.
Compute the transformation from previous pose to current pose
Project current depth to point cloud and transform it using the above transformation, followed by sampling the hidden state

This does not seem correct to me. If we are using the current depth D_t, then there isn't any need for transforming the point cloud and we can directly sample the hidden state. Additionally, why transform the current depth by using a transformation of previous to current?

If we used D_t-1 as the depth estimate i.e. depth_estimation = depths_cuda[measurement_index] instead of depth_estimation = depths_cuda[reference_index] over here, then it would make sense.

It would be great if you could shed some light on this and correct me if I've gotten anything wrong!

ardaduz commented 2 years ago

The steps you wrote here is definitely the right steps, and I think all these operations are correct and necessary for what we are trying to achieve. All of these operations combined does only a single big operation named inverse mapping. In short and in our context, for each pixel location (in the ConvLSTM hidden state after warping) in the current time-step, we are trying to sample the values from the previous hidden state. Hence, we are mapping values from previous to current, but in an inverse fashion. Inverse mapping is an important general concept in image processing to achieve completeness, correctness at the destination image with the help of well defined and efficient sampling methods like bilinear sampling.

We have $D_t$ or $\tilde{D}_t$ (does not matter if training or not, it is the same procedure), so we have a depth map with a depth value at each pixel that we are concerned about.
and 3. This transformation puts point cloud resulting from $D_t$, into the t-1 camera frame. The following is not mathematically correct, matrix dimensions and homogenous coordinate requirements at some steps are neglected but it should give you an idea on projection steps. Pose at t is $T_t$ and extrinsic is $Tt^{-1}$. $ Q{t-1} \sim K \ \dot\ \ T_{t-1}^{-1} \ \dot\ \ T_t \ \dot\ \ K^{-1} \ \dot\ \ Dt $ , let $Q{t-1}$ be the sampling locations on the previous hidden state.

Above steps are the inverse mapping, it is complete.

ardaduz commented 2 years ago

LaTeX is broken for some reason, so I'm writing the next bits a bit weird.

Now there is another thing that may be the source of confusion: During testing, let alone having $D_t$, we are trying to predict it. So, we do a similar operation to the equation above to get $\tilde{D}_t$.

Now, instead of getting $Q{t-1}$ from $D{t}$,

we get (estimate) $\tilde{D}_t$

from $D_{t-1}$ by also considering occlusions between 3D points.

This is the preparation step for the inverse sampling. All in all, it may look like we are going back and forth, but it is necessary for sticking to inverse mapping paradigm. Otherwise, we have to deal with differentiable point cloud rendering: unproject hidden state at t1 to a 3D point cloud, and render it from the viewpoint at t. This is a highly complex and approximate operation. Keep in mind that, we don't want to break the gradient flow through the unrolled states of the ConvLSTM.

Note: I am not sure what you mean by the following, "This does not seem correct to me. If we are using the current depth D_t, then there isn't any need for transforming the point cloud and we can directly sample the hidden state."

Hope it's a bit clearer.

sriramsk1999 commented 2 years ago

I think my confusion stemmed from a misunderstanding of inverse mapping, I didn't realize projecting $Dt$ to $D{t-1}$ was necessary before sampling. The explanation cleared it up. Thank you!

ardaduz / deep-video-mvs

Possibly incorrect ConvLSTM warping #22