YihongSun / Dynamo-Depth

[NeurIPS 2023] Dynamo-Depth: Fixing Unsupervised Depth Estimation for Dynamical Scenes
https://dynamo-depth.github.io
MIT License
73 stars 5 forks source link

Maybe a bug in warping flow calculation #9

Closed RickyYXY closed 4 months ago

RickyYXY commented 4 months ago

In Trainer.py, line 265-267, your code seems to use independ_flow and camera/depth to get sample points. According to the code writing in line 250-253, the logic should be: _independ_flow = complete_flow - egoflow.

So in line 265-267, the right code should be:

cam_points = Trans(cam_points) # adding ego flow
sample = proj(cam_points + independ_flow)

Your current code's logic is:

cam_points = cam_points + independ_flow # the independ_flow should not be transform
sample = proj(Trans(cam_points))

The order seems wrong, and it actually causes wrong transform results.' Would you mind checking this part? Thks!

YihongSun commented 4 months ago

Thank you for your interest in our work!

By compressing the two code snippets, we get the the following two lines.

sample = proj(Trans(cam_points) + independ_flow)

versus

sample = proj(Trans(cam_points + independ_flow))

Essentially, we would like to apply two types of motions to each of the cam_points, one is ego-motion (represented via T) and one is independent motion (represented via independ_flow). Thus, the two lines above represent the two orders:

Since we compute one independent flow from three frames (with frame_index=0 in the middle and opposite signs for opposite directions), this means the prediction would need to be accurate for both forward and reverse direction.

In another word, imagine a crossing object with a certain speed. If we apply the object's motion after changing frame of reference, the speed would be smaller if we are closer, larger if we are farther way. This is precisely what is handled by Trans(independ_motion) in proj(Trans(cam_points + independ_flow)).

Hope this helps and feel free to reply if you have any other questions or concerns!

RickyYXY commented 4 months ago

Thanks for your reply, I can get the point. But a more important thing is that when you predict complete_flow and calculate independ_flow, the assumption is the former one. I think it is weird if you use the former setting in getting independ_flow but using the latter setting for loss calculation.

Whatever how the process goes, the final sample should be the same as complete_flow projected on image, right? The current code seems not fit this assumption, as the independ flow is additionally transformed by T.

RickyYXY commented 4 months ago

I may find a better explanation: When first calculating independ_flow = complete_flow - ego_flow, all variables are in the _camera_i to cameraj transform (different from camera to world or world to camera), and this transform attribute should keep the same. But when using the latter assumption, the transform attribute for independ_flow is changed into camera to world. The core meaning of the variable is different.

YihongSun commented 4 months ago

Thanks for clarifying! Let me see if I am understanding your point correctly:

Assuming the definition of complete_flow to be Trans(cam_points + independ_motion) = cam_points + complete_flow

which means

Trans(cam_points) + Trans(independ_motion) = cam_points + complete_flow
cam_points + ego_motion + Trans(independ_motion) = cam_points + complete_flow
independ_motion = Trans^-1(complete_flow-ego_motion)

In contrast, assuming the definition of complete_flow to be Trans(cam_points) + independ_motion = cam_points + complete_flow then

cam_points + ego_motion + independ_motion = cam_points + complete_flow
independ_motion = complete_flow - ego_motion

Is this along the lines of what you mean?

RickyYXY commented 4 months ago

Thks for reply! Yes, that is the point. And according to line 252-253 in Trainer.py, the independ_motion is calculated based on the definition Trans(cam_points) + independ_motion = cam_points + complete_flow which complete_flow - ego_flow = complete_flow - (Trans(cam_points) - cam_points) = residual_flow (can be changed to independ_motion/independ_flow with a mask)

But in line 265-267 it is based on the other definition: Trans(cam_points + independ_motion) = cam_points + complete_flow.

YihongSun commented 4 months ago

Got it. I need to think about this a bit further to make sure, but right now, I think you are right and thanks for the catch!

Actually, if we consider the network output to be approximating Trans^-1(complete_flow-ego_motion) + ego_motion instead of complete_flow directly, then the E2E fine-tuning stage of the code should work as intended (shown below).

output = Trans^-1(complete_flow-ego_motion) + ego_motion
residual_motion = output - ego_motion
residual_motion = Trans^-1(complete_flow-ego_motion)

and thus

Trans(cam_points + residual_motion) = cam_points + complete_flow
RickyYXY commented 4 months ago

Happy to have this discussion! In my opinion, if you directly consider Trans^-1(complete_flow-ego_motion) + ego_motion instead of complete_flow, then the output is actually not the complete_flow we defined before. In this case, in line 270-271, the proj part need to be changed too, as its projection is based on our old definition. Directly using it will cause the wrong projection results.

YihongSun commented 4 months ago

Yep, I agree, and thanks for the discussion!

BTW, L270-271 is intended as a quick warm-up, where actual convergence is reached in mask learning (5 epochs) and E2E fine-tune (20 epochs).

RickyYXY commented 4 months ago

Get it!