haoweiz23 commented 2 years ago

Hi, thank your for your works. I am trying to reproduce your pose consistency loss. This loss constraints the predicted pose from other camera to be consistent with the front camera after transformation. However, It is hard to understand how the coordinate transforms to other coordinate by Equ.5 . Could you please provide more explanation clues or detail code? Thanks.

hjxwhy commented 2 years ago

This is my code I implement, I can not promise it's right `
calculate the pose consistency loss :param poses: list of torch.tensor [B, 4, 4] transform those prediction to coordinate frame of canonical camera. :param extrinsics: torch.tensor [B, 4, 4] extrinsics for all cameras, which be used to transform pose. :return:

    rot_loss = 0
    trans_loss = 0
    extrinsics = extrinsics.to(poses[0].item().dtype)
    canonical_extrinsic = extrinsics[0].repeat([extrinsics.shape[0], 1, 1])  # [B, 4, 4]
    canonical_extrinsic = Pose(canonical_extrinsic)
    extrinsics = Pose(extrinsics)

    # extrinsic = extrinsics[1:, ...]
    for pose in poses:
        X_i2j = canonical_extrinsic.inverse() @ extrinsics
        X_ba = X_i2j @ pose @ X_i2j.inverse()
        rot_loss += torch.sum((X_ba.mat2vec()[:, :3] - pose.mat2vec()[0, :3]).pow(2))
        trans_loss += torch.sum((X_ba.mat2vec()[:, 3:] - pose.mat2vec()[0, 3:]).pow(2))
    loss = self.rotation_weight * rot_loss + self.translation_weight * trans_loss

`

haoweiz23 commented 2 years ago

@hjxwhy Thanks a lot. I believe this is right. By the way, have you eval implement spatio-temporal loss in FSM? I cannot achieve the same improvement (even decrease) as Table.3 in FSM paper. Maybe there are some problem in my implementation.

I implement spatial-wise pe loss as below, as Equation 3 in FSM paper.

` def spatial_wise_pe_loss(self, batch, output, return_logs=False, progress=0.0):

Calculate spatial contexts

    spatial_contexts_indices = np.array([[1, 2], [0, 3], [0, 4], [1, 5], [2, 5], [3, 4]])
    spatial_contexts_rgb = [batch['rgb_original'][spatial_contexts_indices[:, 0]],
                            batch['rgb_original'][spatial_contexts_indices[:, 1]]]
    poses = torch.Tensor(batch['extrinsics']) if isinstance(batch['extrinsics'], list) else batch['extrinsics']
    intrinsics = torch.Tensor(batch['intrinsics']) if isinstance(batch['intrinsics'], list) else batch[
        'intrinsics']
    spatial_context_intrinsics = [intrinsics[spatial_contexts_indices[:, 0]],
                                  intrinsics[spatial_contexts_indices[:, 1]]]

    spatial_context_masks = [batch['mask'][spatial_contexts_indices[:, 0]],
                                  batch['mask'][spatial_contexts_indices[:, 1]]]

    source_poses = Pose(poses)
    reference_poses = [Pose(poses[spatial_contexts_indices[:, 0]]),
                       Pose(poses[spatial_contexts_indices[:, 1]])]

    relative_poses = [Pose(torch.bmm(reference_poses[0].inverse().item(), source_poses.item())),
                      Pose(torch.bmm(reference_poses[1].inverse().item(), source_poses.item()))]
    spatial_output = self.self_supervised_loss(
        batch['rgb_original'], spatial_contexts_rgb,
        output['inv_depths'], relative_poses, intrinsics, spatial_context_intrinsics,
        return_logs=return_logs, progress=progress, mask=batch['mask'], ref_mask=spatial_context_masks)
    return spatial_output`

VitorGuizilini-TRI commented 2 years ago

The implementation looks alright to me. Some things that have helped other people achieving similar results:

Starting from a pre-trained model without the spatio-temporal constraints
Defining a larger value for the minimum depth of the network, so there is overlap between cameras to begin with (otherwise the temporal network can produce a scale that doesn't have any spatial overlap, and it doesn't leverage those constraints)
Focal length scaling for the output depth maps (the front camera of DDAD has a different intrinsics than other cameras)

hurjunhwa commented 2 years ago

Hi, What do you mean by focal length scaling? Would you mind if you provide more details regarding that? Instead of training the depth decoder to handle different intrinsics, is it about using a constant to rescale the depth value for the front view?

Thank you!

haoweiz23 commented 2 years ago

@VitorGuizilini-TRI Thanks a lot ! Your suggestion is very helpful. I tried focal length scaling and it works. I am tryining start from a pretrained model without the spatio-temporal constraints now. And I don't quite understand your second suggestion. Why larger value for the minimum depth helps? Is it because the larger depth can produce more overlapping areas when perform projection transformation between different cameras? If so, do you have a recommend minimum depth? Thank you again for your timely suggestions.

@hurjunhwa Hi, I implement focal length scaling by scale the output depth by a constant, i.e., focal length. This focal length comes from the intrinsics input. Because I do not have the camera parameters, e.g., dx and dy. So I simply take the f_x in intrinsics as focal length to scale the depth. I tried this trick on DDAD and it works. Hope this can be helpful.

hjxwhy commented 2 years ago

@LionRoarRoar My STC implement is the same as you,but the result also degrade. You have try to scale the depth by focal length, which means that the every camera output multiple focal length or divide focal length? By the way, as my test, the input image with self occlusion cause the RMSE larger than front camera only, Have you faced this problem?

haoweiz23 commented 2 years ago

I scale each camera output with its corresponding focal length. All other cameras get all worse results than front camera in my experiments. Only RMSE larger than front camera seems unreasonable? Maybe you have wrong normalization layer in last output layer.

hjxwhy commented 2 years ago

@LionRoarRoar Thanks for your reply. I have an experiment that train only front camera and CAMERA_8 seperate, the CAMERA_8 is worse than front camera in all metrics, so I guess it's cause by the self occlusion in image in CAMERA_8. But I'm not sure because the paper seems don't have this problem. Do you plan to do this experiment? I'm sorry for ask again, the scale depth means inverse depth multiple focal length?

haoweiz23 commented 2 years ago

@hjxwhy A1: Maybe your hypothesis is right. I noticed that self-occlusion have slightly shift on different frames, which means it is hard to pre-define a accurate self occlusion mask. Images from front camera is clean and withoud occlusion, so it should get better results than other cameras.

A2: You should scale depth map instead of inverse depth

hjxwhy commented 2 years ago

@LionRoarRoar THANKS, I will try again. If I have some new results I will share with you here. Best wishes!

haoweiz23 commented 2 years ago

Updates: 1、I tried spatial-wise constraint start from a pre-trained model without the spatio-temporal constraints. It indeed better than w/o pretrained. However, it still worse than baseline model. Besides, I am afraid this trick make spatial-wise constraint can not be compared with baseline fairly?

2、I also tried spatial-wise loss with a larger min_depth start from a pre-trained model without the spatio-temporal constraints. And the performance drops.

abing222 commented 2 years ago

This is my code I implement, I can not promise it's right ` calculate the pose consistency loss :param poses: list of torch.tensor [B, 4, 4] transform those prediction to coordinate frame of canonical camera. :param extrinsics: torch.tensor [B, 4, 4] extrinsics for all cameras, which be used to transform pose. :return:
    rot_loss = 0
    trans_loss = 0
    extrinsics = extrinsics.to(poses[0].item().dtype)
    canonical_extrinsic = extrinsics[0].repeat([extrinsics.shape[0], 1, 1])  # [B, 4, 4]
    canonical_extrinsic = Pose(canonical_extrinsic)
    extrinsics = Pose(extrinsics)

    # extrinsic = extrinsics[1:, ...]
    for pose in poses:
        X_i2j = canonical_extrinsic.inverse() @ extrinsics
        X_ba = X_i2j @ pose @ X_i2j.inverse()
        rot_loss += torch.sum((X_ba.mat2vec()[:, :3] - pose.mat2vec()[0, :3]).pow(2))
        trans_loss += torch.sum((X_ba.mat2vec()[:, 3:] - pose.mat2vec()[0, 3:]).pow(2))
    loss = self.rotation_weight * rot_loss + self.translation_weight * trans_loss
`

rot_loss += torch.sum((X_ba.mat2vec()[:, :3] - pose.mat2vec()[0, :3]).pow(2))，I think the pose here should use cam1_pose supervise

abing222 commented 2 years ago

Updates: 1、I tried spatial-wise constraint start from a pre-trained model without the spatio-temporal constraints. It indeed better than w/o pretrained. However, it still worse than baseline model. Besides, I am afraid this trick make spatial-wise constraint can not be compared with baseline fairly?

2、I also tried spatial-wise loss with a larger min_depth start from a pre-trained model without the spatio-temporal constraints. And the performance drops.

Have you reached the accuracy of the paper? I can't reproduce it

haoweiz23 commented 2 years ago

@abing222 No. Only Self-oclussion mask work. STC and Pose consistency loss does not work.

abing222 commented 2 years ago

@abing222 No. Only Self-oclussion mask work. STC and Pose consistency loss does not work.

In my experiment, Only Self-oclussion mask absrel did not decline as much as the paper

abing222 commented 2 years ago

@abing222 No. Only Self-oclussion mask work. STC and Pose consistency loss does not work.

At present, I can obtain the absolute scale through spatio, the accuracy decreases slightly. After adding STC, the accuracy increases a little

haoweiz23 commented 2 years ago

@abing222 No. Only Self-oclussion mask work. STC and Pose consistency loss does not work.

At present, I can obtain the absolute scale through spatio, the accuracy decreases slightly. After adding STC, the accuracy increases a little

You mean spatial-wise constraints not work but STC works？ That is interesting. Could you please provide more implement details about your STC？Such as loss weight, how to warp spatial-temporal image

abing222 commented 2 years ago

Self-oclussion mask

@abing222 No. Only Self-oclussion mask work. STC and Pose consistency loss does not work.

At present, I can obtain the absolute scale through spatio, the accuracy decreases slightly. After adding STC, the accuracy increases a little

You mean spatial-wise constraints not work but STC works？ That is interesting. Could you please provide more implement details about your STC？Such as loss weight, how to warp spatial-temporal image

spatial-wise is useful, provide absolute scale, but the accuracy decreased. I changed code on the basis of monodepth2 repo code without using packnet repo

weiyithu commented 2 years ago

@abing222 No. Only Self-oclussion mask work. STC and Pose consistency loss does not work.

At present, I can obtain the absolute scale through spatio, the accuracy decreases slightly. After adding STC, the accuracy increases a little

I also cannot obtain the absolute scale with the spatio photometric loss. Do you use any pretrained model? Or change the min_depth parameter in monodepth2 repo?

haoweiz23 commented 2 years ago

@abing222 No. Only Self-oclussion mask work. STC and Pose consistency loss does not work.

At present, I can obtain the absolute scale through spatio, the accuracy decreases slightly. After adding STC, the accuracy increases a little

I also cannot obtain the absolute scale with the spatio photometric loss. Do you use any pretrained model? Or change the min_depth parameter in monodepth2 repo?

Hi, weiyi. I am also try to implement this work. Maybe we can add wechat for discussion. My wechat: zhuhaow_

TRI-ML / packnet-sfm

About the Equation 5 for Full Surround Monodepth from Multiple Cameras #218

Calculate spatial contexts