Possibility to train the motion network with only 2 images

YihongSun / Dynamo-Depth

[NeurIPS 2023] Dynamo-Depth: Fixing Unsupervised Depth Estimation for Dynamical Scenes

https://dynamo-depth.github.io

MIT License

68 stars 5 forks source link

Possibility to train the motion network with only 2 images #8

Closed dung-ng179 closed 2 months ago

dung-ng179 commented 4 months ago

Hello, Thank you for this awesome work. I see that you use 3 images as input to train your optical flow and motion mask networks. Is it possible to train these networks with only 2 images (previous frame + current frame) and (next frame + current frame) as input. If it is possible, can you give me the change in your code to do it? Thank you so much!

YihongSun commented 4 months ago

Thank you for your interest in our work!

Training with two frames can be achieved by making the following changes to model.py:

Modify num_input_images=2 for self.motion_enc, self.motion_dec, and self.motion_mask .
Set f_gap to be bi-directional by removing abs().
Compose motion_input using 2 frames and update the final output for both directions independently.

However, this may be suboptimal compared to the proposed three-frame method:

By predicting once for both directions, we explicitly force the forward and backward predictions to be the same up to a sign switch, which regularize training.
The model naturally has more information from three frames to better learn and predict independent motion.

huydung179 commented 4 months ago

Thank you for your answer!

For the flow output, we don't inverse the flow (line 140-141 of the model file), right?

outputs.update({(k[0], -1, k[1]) :  1 * v for k,v in motion_out_-1.items()})
outputs.update({(k[0], 1, k[1]) :  1 * v for k,v in motion_out_1.items()})

Yeah I see the advantage of using 3 frames for predicting the forward and backward flow. However, in my project, I have only 2 frames at test time :((

I have other questions. Have you tried using 2 frames for motion networks or even using a shared encoder for pose, flow and mask networks? I love to have your opinion on this.

Thank you very much!

YihongSun commented 3 months ago

Glad to help!

For the flow output, we don't inverse the flow (line 140-141 of the model file), right?

Correct. If the target frame is always placed last, then with a similar logic as computing camTcam in L95, the computation would be the same for both fwd and bwd direction.

Have you tried using 2 frames for motion networks

During development, we tried with two frames first, but found that due to the min operation here, the motion network just need to compute the independent motion correctly for one direction to satisfy the reconstruction optimization. Therefore, to enforce consistency, we choose to explicitly predict one flow for both direction (i.e. 3 frame approach), but I could imagine an auxiliary regularization on fwd/bwd consistency would be sufficient as well.

using a shared encoder for pose, flow and mask networks?

That may be possible, but I am not sure how well the single encoder can work for all three tasks.

Predicting independent motion (flow+mask) would rely on dynamical regions, but predicting ego motion mostly rely on the static regions.