Closed sriramsk1999 closed 2 years ago
The steps you wrote here is definitely the right steps, and I think all these operations are correct and necessary for what we are trying to achieve. All of these operations combined does only a single big operation named inverse mapping. In short and in our context, for each pixel location (in the ConvLSTM hidden state after warping) in the current time-step, we are trying to sample the values from the previous hidden state. Hence, we are mapping values from previous to current, but in an inverse fashion. Inverse mapping is an important general concept in image processing to achieve completeness, correctness at the destination image with the help of well defined and efficient sampling methods like bilinear sampling.
Above steps are the inverse mapping, it is complete.
LaTeX is broken for some reason, so I'm writing the next bits a bit weird.
Now there is another thing that may be the source of confusion: During testing, let alone having $D_t$, we are trying to predict it. So, we do a similar operation to the equation above to get $\tilde{D}_t$.
Now, instead of getting $Q{t-1}$ from $D{t}$,
we get (estimate) $\tilde{D}_t$
from $D_{t-1}$ by also considering occlusions between 3D points.
This is the preparation step for the inverse sampling. All in all, it may look like we are going back and forth, but it is necessary for sticking to inverse mapping paradigm. Otherwise, we have to deal with differentiable point cloud rendering: unproject hidden state at t1 to a 3D point cloud, and render it from the viewpoint at t. This is a highly complex and approximate operation. Keep in mind that, we don't want to break the gradient flow through the unrolled states of the ConvLSTM.
Note: I am not sure what you mean by the following, "This does not seem correct to me. If we are using the current depth D_t, then there isn't any need for transforming the point cloud and we can directly sample the hidden state."
Hope it's a bit clearer.
I think my confusion stemmed from a misunderstanding of inverse mapping, I didn't realize projecting $Dt$ to $D{t-1}$ was necessary before sampling. The explanation cleared it up. Thank you!
Hello, thanks for releasing your work. I had a question regarding the implementation of the depth warping:
As I understand it, this is the current flow of the code during training:
D_t
as the current estimate. As mentioned in the paper, this is done for stabilization of training.This does not seem correct to me. If we are using the current depth
D_t
, then there isn't any need for transforming the point cloud and we can directly sample the hidden state. Additionally, why transform the current depth by using a transformation of previous to current?If we used
D_t-1
as the depth estimate i.e.depth_estimation = depths_cuda[measurement_index]
instead ofdepth_estimation = depths_cuda[reference_index]
over here, then it would make sense.It would be great if you could shed some light on this and correct me if I've gotten anything wrong!