Potential bug in depth estimation training code - Githubissues

EPFL-VILAB / omnidata

A Scalable Pipeline for Making Steerable Multi-Task Mid-Level Vision Datasets from 3D Scans [ICCV 2021]

Other

395 stars 49 forks source link

Potential bug in depth estimation training code #41

Closed JasonQSY closed 6 months ago

JasonQSY commented 1 year ago

Hi,

It looks like the training of depth estimation has some issue.

It seems depth prediction and ground truth are reversed in virtual normal loss. https://github.com/EPFL-VILAB/omnidata/blob/main/omnidata_tools/torch/train_depth.py#L272

vn_loss = self.vnl_loss(depth_preds, depth_gt)

However in the forward function https://github.com/EPFL-VILAB/omnidata/blob/main/omnidata_tools/torch/losses/virtual_normal_loss.py#L151

def forward(self, gt_depth, pred_depth, select=True):

I think the camera intrinsics in virtual normal loss is also wrong. In https://github.com/EPFL-VILAB/omnidata/blob/main/omnidata_tools/torch/train_depth.py#L80,

self.vnl_loss = virtual_normal_loss.VNL_Loss(1.0, 1.0, (self.image_size, self.image_size))

The focal length seems to assume the camera width is 1 or it's fov. However, it's actually in pixel space. If you look at the code in virtual normal loss which projects depth back to point cloud, it's in screen space. https://github.com/EPFL-VILAB/omnidata/blob/main/omnidata_tools/torch/losses/virtual_normal_loss.py#L44

alexsax commented 1 year ago

Hi! Apologies for the late reply.

For 1. It does look like the GT and preds are swapped -- @Ainaz99 can you confirm? I think the effect here is simply that the loss is applied even on pixels that should be masked. Otherwise the VN loss should be correct since it is symmetric except for masking based on GT depth. My guess is that this didn't affect training much since we only selected the quartile of points with the smallest VNL on which to apply the loss (same as the midas paper).

For 2. I believe the fx and fy correspond to NDC and not screen space and so this should be correct. The naming is perhaps a bit confusing here and please correct me (or @Ainaz99 correct me) if I'm wrong.