autonomousvision / monosdf

[NeurIPS'22] MonoSDF: Exploring Monocular Geometric Cues for Neural Implicit Surface Reconstruction
MIT License
548 stars 49 forks source link

depth loss #22

Closed LiXinghui-666 closed 1 year ago

LiXinghui-666 commented 1 year ago
    culart

Hi, the monocular depth output is small and it's in the range of 0.01 to 0.04, and the loss will be very small so we simply scale them to some large value.

There's the same question about scale scaling. Since a scale and shift are calculated in advance when the depth loss is calculated, what is the function of the following line in the code?

# we should use unnormalized ray direction for depth
ray_dirs_tmp, _ = rend_util.get_camera_params(uv, torch.eye(4).to(pose.device)[None], intrinsics)
depth_scale = ray_dirs_tmp[0, :, 2:]
depth_values = depth_scale * depth_values

Originally posted by @LiXinghui-666 in https://github.com/autonomousvision/monosdf/issues/18#issuecomment-1275725230

niujinshuchong commented 1 year ago

Hi, it's used to scale the distance to depth because the ray_dirs is normalised so the z_vals is the Euclidian distance not the depth.

image

Wjt-shift commented 1 year ago

I also have some problems about depth loss.As you said,monocular depth output range in 0.01 to 0.04,should I normalization gt depth in this range.I use nice_slam_apartment_to_monosdf.py get Apartment dataset depth and rgb,I want to only use depth and rgb not normal,but I found the gt depth is not in 0.01-0.04 range,the depth loss is large and can't coverge. image

niujinshuchong commented 1 year ago

@Wjt-shift If you use gt-depth, you don't need to use scale-invariant loss as it is designed to handle scale ambiguity in monocular depth. You could simply use L1 or L2 loss for the gt-depth but you need to scale the gt-depth accordingly if you normalize the camera poses.

Wjt-shift commented 1 year ago

@Wjt-shift If you use gt-depth, you don't need to use scale-invariant loss as it is designed to handle scale ambiguity in monocular depth. You could simply use L1 or L2 loss for the gt-depth but you need to scale the gt-depth accordingly if you normalize the camera poses.

Thanks for your reply! I also have questions about your code.In nice_slam_apartment_to_monosdf.py and scannet_to_monosdf.py,why the gt_depth should be divided 1000?1000 is scale?And how you get your replica datasets format for monosdf?because I found in your replica datasets,the depth is in range 0.01 to 0.04. image

niujinshuchong commented 1 year ago

@Wjt-shift gt_depth / 1000 because this is how the gt depths are saved. The replica dataset is converted from nice-slam, and we run ominidata to get mono depth and normals and the omnidata depth output is in the range of 0.01 to 0.04.

Wjt-shift commented 1 year ago

@Wjt-shift If you use gt-depth, you don't need to use scale-invariant loss as it is designed to handle scale ambiguity in monocular depth. You could simply use L1 or L2 loss for the gt-depth but you need to scale the gt-depth accordingly if you normalize the camera poses.

I am sorry to bother you again.I use gt_depth with L1 loss.But I find the depth L1 loss can't coverge.As you said scale the gt_depth.In Replica datasets,I get the scale in camera.json file ,I input gt_depth divide scale,I found gt_depth in range of 0 to 6,but output is not in this range,so the depth L1 loss can't coverge.I compare the training only rgb loss and rgb+depth L1 loss,but only rgb loss can get a better result than add depth L1 loss.What cause this problem?What should I scale my depth? This is the depth L1 loss when training, image And compare the only rgb loss and rgb+depth L1 loss,the green one is only use rgb loss,another one is use rgb+depth L1 loss. image compare the psnr,the green one is only use rgb loss,another one is use rgb+depth L1 loss. image

niujinshuchong commented 1 year ago

@Wjt-shift The gt depth in replica is in range of [0, 6], so after your normalization, it should be in range around [0, 2] because the scene in normalized in [-1, 1]? Maybe you should multiply the depth with scale instead of dividing it.

Wjt-shift commented 1 year ago

@Wjt-shift The gt depth in replica is in range of [0, 6], so after your normalization, it should be in range around [0, 2] because the scene in normalized in [-1, 1]? Maybe you should multiply the depth with scale instead of dividing it.

Thanks for your quick reply!I processed my gt_depth when loaded from depth image and not dividing 1000 instead of dividing 6553.5(scale in camera.json).And I use this gt_depth as input .Should I normalized input gt_depth and output depth in range of [0,1] use formulation (depth-depth.min)/(depth.max-depth.min)? This is camera.json image I try to simultaneously normalize input and output depth in range of [0,1],but can't get your replica dataset's great result. The bule one is your replica dataset result. image image

niujinshuchong commented 1 year ago

Hi, after you divide the depth by 6553.5, you get the depth value in meters. But we further normalize the scene where the normalization factor is computed from the gt mesh and saved as scale_mat in cameras.npz. So you need to dived by this scale also. For example

detph_gt = cv2.imread("depth.png", -1 ) / 6553.5
depth_gt_for_training = depth_gt / (np.load('cameras.npz')['scale_mat_0'][0, 0])
Wjt-shift commented 1 year ago

Hi, after you divide the depth by 6553.5, you get the depth value in meters. But we further normalize the scene where the normalization factor is computed from the gt mesh and saved as scale_mat in cameras.npz. So you need to dived by this scale also. For example

detph_gt = cv2.imread("depth.png", -1 ) / 6553.5
depth_gt_for_training = depth_gt / (np.load('cameras.npz')['scale_mat_0'][0, 0])

Thanks for you reply! It seems solved after your suggestions.I also have another question.If I get the depth without scale(such as from monocular slam system),I should use scale_mat scale depth(depth divide scale_mat)?(scale_mat seems normalized pose to unit cube?)And this situation,should I use your depth loss?Thanks for you again,your suggestions is helpful.

niujinshuchong commented 1 year ago

Hi, depth from a monocular system is also up to scale, and in this case, you should use scale-invariant loss in MonoSDF.

Wjt-shift commented 1 year ago

Hi, after you divide the depth by 6553.5, you get the depth value in meters. But we further normalize the scene where the normalization factor is computed from the gt mesh and saved as scale_mat in cameras.npz. So you need to dived by this scale also. For example

detph_gt = cv2.imread("depth.png", -1 ) / 6553.5
depth_gt_for_training = depth_gt / (np.load('cameras.npz')['scale_mat_0'][0, 0])

Thanks for you reply! It seems solved after your suggestions.I also have another question.If I get the depth without scale(such as from monocular slam system),I should use scale_mat scale depth(depth divide scale_mat)?(scale_mat seems normalized pose to unit cube?)And this situation,should I use your depth loss?Thanks for you again,your suggestions is helpful.

Hi,I thought the problem solved before.But with the train going,the results did not meet expectations.I compare with the result with 3 situation: i)my Replica dataset only rgb and eikonal_loss(red color line) ii)my Replica with rgb ,eikonal and depth L1 loss,gt_depth deal with like you suggest format,and not other normalize options:(orange color line) detph_gt = cv2.imread("depth.png", -1 ) / 6553.5 depth_gt_for_training = depth_gt / (np.load('cameras.npz')['scale_mat_0'][0, 0])` iii) in your replica dataset with rgb ,eikonal, depth and normal loss.(blue color line) This is the result for 3 situation. image image It seems unreasonable.The only rgb and eikonal_loss seems coverge fast than rgb ,eikonal and depth L1 loss.And depth L1 loss can't get a better psnr with same iteration times.When use depth_gt,the output depth mean in iteration like follow img: image Should I process gt_depth with other options?should change range with scale gt_depth?(such as your suggestion

before:

The gt depth in replica is in range of [0, 6], so after your normalization, it should be in range around [0, 2] because the scene in normalized in [-1, 1]? Maybe you should multiply the depth with scale instead of dividing its )

Wjt-shift commented 1 year ago

Hi, depth from a monocular system is also up to scale, and in this case, you should use scale-invariant loss in MonoSDF.

And if use different datasets,the constant 50 and 5 in the code should change to suit output depth? self.depth_loss(depth_pred.reshape(1, 32, 32), (depth_gt * 50 + 0.5).reshape(1, 32, 32), mask.reshape(1, 32, 32))

niujinshuchong commented 1 year ago

I would suggest you first make it work on our preprocessed dataset with GT depth supervision. The scaling constant 50 is chosen heuristically (see explanation at the very beginning of this issue) and needs to be changed if you use other datasets or try to tune the weights of depth loss.

niujinshuchong commented 1 year ago

Hi, we added support for rgbd data in SDFStudio. Maybe you could try it out if interested.