Closed RickyYXY closed 4 months ago
Thank you for your interest in our work!
By compressing the two code snippets, we get the the following two lines.
sample = proj(Trans(cam_points) + independ_flow)
versus
sample = proj(Trans(cam_points + independ_flow))
Essentially, we would like to apply two types of motions to each of the cam_points
, one is ego-motion (represented via T
) and one is independent motion (represented via independ_flow
). Thus, the two lines above represent the two orders:
cam_points
via ego-motion, and then apply independent motion in the transformed frame of reference.cam_points
via independent motion in the original frame of reference, and then apply ego-motion transformation.Since we compute one independent flow from three frames (with frame_index=0
in the middle and opposite signs for opposite directions), this means the prediction would need to be accurate for both forward and reverse direction.
independ_flow
has to work with multiple frames of reference due to different T
's for forward and reverse directions. independ_flow
just needs to work well with the original frame of reference.In another word, imagine a crossing object with a certain speed. If we apply the object's motion after changing frame of reference, the speed would be smaller if we are closer, larger if we are farther way. This is precisely what is handled by Trans(independ_motion)
in proj(Trans(cam_points + independ_flow))
.
Hope this helps and feel free to reply if you have any other questions or concerns!
Thanks for your reply, I can get the point.
But a more important thing is that when you predict complete_flow
and calculate independ_flow
, the assumption is the former one. I think it is weird if you use the former setting in getting independ_flow
but using the latter setting for loss calculation.
Whatever how the process goes, the final sample should be the same as complete_flow projected on image, right?
The current code seems not fit this assumption, as the independ flow
is additionally transformed by T
.
I may find a better explanation:
When first calculating independ_flow = complete_flow - ego_flow
, all variables are in the _camera_i to cameraj transform (different from camera to world or world to camera), and this transform attribute should keep the same.
But when using the latter assumption, the transform attribute for independ_flow
is changed into camera to world. The core meaning of the variable is different.
Thanks for clarifying! Let me see if I am understanding your point correctly:
Assuming the definition of complete_flow
to be
Trans(cam_points + independ_motion) = cam_points + complete_flow
which means
Trans(cam_points) + Trans(independ_motion) = cam_points + complete_flow
cam_points + ego_motion + Trans(independ_motion) = cam_points + complete_flow
independ_motion = Trans^-1(complete_flow-ego_motion)
In contrast, assuming the definition of complete_flow
to be
Trans(cam_points) + independ_motion = cam_points + complete_flow
then
cam_points + ego_motion + independ_motion = cam_points + complete_flow
independ_motion = complete_flow - ego_motion
Is this along the lines of what you mean?
Thks for reply!
Yes, that is the point. And according to line 252-253 in Trainer.py, the independ_motion is calculated based on the definition
Trans(cam_points) + independ_motion = cam_points + complete_flow
which
complete_flow - ego_flow = complete_flow - (Trans(cam_points) - cam_points) = residual_flow (can be changed to independ_motion/independ_flow with a mask)
But in line 265-267 it is based on the other definition: Trans(cam_points + independ_motion) = cam_points + complete_flow
.
Got it. I need to think about this a bit further to make sure, but right now, I think you are right and thanks for the catch!
Actually, if we consider the network output to be approximating Trans^-1(complete_flow-ego_motion) + ego_motion
instead of complete_flow
directly, then the E2E fine-tuning stage of the code should work as intended (shown below).
output = Trans^-1(complete_flow-ego_motion) + ego_motion
residual_motion = output - ego_motion
residual_motion = Trans^-1(complete_flow-ego_motion)
and thus
Trans(cam_points + residual_motion) = cam_points + complete_flow
Happy to have this discussion!
In my opinion, if you directly consider Trans^-1(complete_flow-ego_motion) + ego_motion
instead of complete_flow
, then the output is actually not the complete_flow we defined before. In this case, in line 270-271, the proj part need to be changed too, as its projection is based on our old definition. Directly using it will cause the wrong projection results.
Yep, I agree, and thanks for the discussion!
BTW, L270-271 is intended as a quick warm-up, where actual convergence is reached in mask learning (5 epochs) and E2E fine-tune (20 epochs).
Get it!
In Trainer.py, line 265-267, your code seems to use independ_flow and camera/depth to get sample points. According to the code writing in line 250-253, the logic should be: _independ_flow = complete_flow - egoflow.
So in line 265-267, the right code should be:
Your current code's logic is:
The order seems wrong, and it actually causes wrong transform results.' Would you mind checking this part? Thks!