Question about the depth loss if using inverse depth sampling

YZsZY commented 1 year ago

Hello Author! I recently tried to add depth supervision to mipnerf360 and then wanted to ask a question.

Because I noticed that you didn't use inverse depth sampling, but I actually feel that inverse depth sampling can help to sample more samples in the near, especially if the sampling range NEAR and FAR is not known, inverse depth sampling is more helpful.

So I would like to ask, if I want to use inverse depth sampling instead of linear sampling, when calculating the depth loss, do I need to convert the depth to inverse depth space as well to calculate the loss of the distribution (the range of Gaussian distribution becomes 0 to 1 as I noticed that mipnerf 360 converts all distance-related losses to the inverse depth space to calculate)

Looking forward to your reply! I've recently been trying to change your code to the pytorch version and would appreciate your guidance!

FelTris commented 1 year ago

Hi!

Yes, using inverse sampling in these unbounded setting is a good idea, in my case it didn't really work though. I guess I had a mistake somewhere, probably in converting the camera positions to normalised space. Just in case you haven't seen it, the official code release for MipNeRF360 probably has the inverse sampling and correct procedure for converting the camera to this normalised space.

If you want to supervise the depth with this inverse depth space you have to either convert the predicted depth to true depth again or convert your ground truth depth to inverse depth. For the distribution/near-surface loss to function more intuitively I would suggest the first case, so converting the predicted inverse depth back to the true/linear depth and then applying the depth losses in that scale. In the inverse scale the distribution of the Gaussian that is being put around the depth measurement could be a bit tricky since the distribution of samples is logarithmic between 0 and 1.

Also, the near-surface loss might not work that well when using inverse depth sampling, since samples for far away points might be distributed very far apart. Say your ground truth depth is at 60, but your nearest samples are at 55 and 70, then these samples are not included in the near-surface loss. It would be interesting to see if it works though!

So to summarise, yes you should convert the predicted inverse depth to linear depth if you want to use the near-surface/gaussian losses. Let me know if you have any other questions!

YZsZY commented 1 year ago

Thank you for your reply! In fact, the inverse depth will only be used when sampling, in the actual the network input is its corresponding Euclidean space coordinates (network input is tdist instead of sdist), so the final rendered depth is also at the Euclidean space scale, but the inverse depth has a very serious problem, like you mentioned, inverse depth sampling is unlikely to pick the exact point, I tried I tried to change 1/x to log(x) thus obtaining a relatively uniform sampling. (The red one is log sample, the blue one is inverse sample(1/x))

Then, besides the sampling problem, I actually have another question! Because I noticed that when using depth to supervise, the method used is to do in segments of mse loss with a standard ideal normal distribution: y=(1/ sqrt(2 * pi * sigma^2) * exp(-(tdist - depth_GT)^2)/ (2 * sigma^2)) and the weight distribution learned by nerf, here I would like to ask a question.

Because the ideal normal distribution actually satisfies the integral sum of 1, that is, ∑y*delta=1, and then the weight is directly summed instead of the integral sum equal to 1, so the scale of the weight and the ideal normal distribution may not be the same, I personally feel that it needs to make the sum of the ideal normal distribution equal to 1. The code should be this: y/=y.sum()，so that the two distributions can be consistent in scale.(As the weight sum is 1 -> weights.sum(dim=-1) = 1 ), otherwise, if you visualize it directly, you will find that the two distributions are very different in scale: 65d6762bd371993c4042da71566a005 The above is the ideal distribution, and the below is the weights, you can see their scale are very different, I noticed that you divided the ideal distribution by its maximum value, but I don't feel that this solves the problem, so here's a more reasonable approach!

If you do not understand the above content, welcome to propose, I may not be very accurate in describing some of the content, look forward to your reply!

FelTris commented 1 year ago

Yes, for the distribution I think you are correct!

I initially chose to divide by y.max() to make the weight that is closest to the ground truth depth close to one but of course the whole distribution then sums to much more than one. The weights on the other hand cannot sum to more than one by nerfs convention so your intuition is true. Your solution totally makes sense, just take care to add some small eps to the division for numerical stability:
y /= y.sum(axis=-1) + 1e-12 Otherwise when the distribution is zero, you would run into issues with nans :) Also in my code you would first apply the mask and then do this division, since otherwise the sum would be nonsense. So instead of:

y /= y.max() y *= mask_near

you would do this:

y *= mask_near y /= y.sum(axis=-1) + 1e-12

Thanks for finding this!

YZsZY commented 1 year ago

I just happen to find this problem about distribution when I try to use the depth loss in urbanRF, although I currently this loss on my code is not very good haha. In addition, nerf studio also have a version of urbanRF depth loss implementation, maybe you can refer to: https://github.com/nerfstudio-project/nerfstudio/blob/f5f424095a05ce565249955f5919f4c06e3c7319/nerfstudio/model_components/losses.py#L268; https://github.com/nerfstudio-project/nerfstudio/pull/1173

FelTris / durf

Question about the depth loss if using inverse depth sampling #1