duanyiqun / DiffusionDepth

PyTorch Implementation of introducing diffusion approach to 3D depth perception ECCV 2024
https://arxiv.org/abs/2303.05021
Apache License 2.0
306 stars 17 forks source link

Question about diffusion #1

Open XiangMochu opened 1 year ago

XiangMochu commented 1 year ago

Hi there, great work! Really appreciate that you open source the code so soon!

I have some questions about the diffusion and denoising process.

The image shown in the README is really impressive: image

Does this image show the denoising process? If so, why the depth contents are shown in a 'near-to-far' way?

The random gaussian noise $\epsilon \sim \mathcal N(0, \mathbf I)$, and the GT depth map / depth prediction sould have been normalized to $[-1, 1]$; however, since the above image shows contents appearing from near to far, should I assume that the final depth map / the depth prediction is not of the range $[-1, 1]$, but is of greater range(e.g., $[0, 80]$ for kitti and $[0, 10]$ for NYU)?

If so, the diffusion and denoising steps are probabily problematic, since commonly, if we choose the gaussian noise as $\mathcal N(0, \mathbf I)$, the output range is chosen as $[-1, 1]$. And I have not yet found in the code about the normalization process.

Correct me if I'm wrong, I would be very happy to hear from you!

duanyiqun commented 1 year ago

Hi there, thanks very much for your good question. The observation is correct, we have the same feeling that the depth contents are shown in a 'near-to-far' way in this case. I think a more precise description would be the initial depth map decoded from the depth latent (gaussian noise) would be averagely 'near', then after denoising, it comes to `far' for many pixels. But some pixels would be even 'closer'. We do observe in some cases, it initialized with a mediate depth value.

Yes, in that case, the depth range is [0,88],[0,10] for the output, but the depth latent is normalized.

The diffusion-denoising process is performed in a latent space with shapes h/2, w/2, and depth-dim. We set the depth dim to 16. This is realized in the depth encoder-decoder. The diffusion latent space is normalized, however, when we visualize this figure, we use the depth inverse transform to visualize the normalized latent feature to depth map with real depth value at each step. Please refer to this mediate layer vis

If during a single inference, considering the inference time, we only decode the latent after the whole denoising process.

About the phenomenon itself, it is rational intuitively since it first forms shapes and edges and refines the depth value step by step. We haven't got clear math proof of why it presents itself in this way. If you have any ideas, I would very much like to discuss this point with you.

Cheers

XiangMochu commented 1 year ago

Thank you so much for the reply! Now I get that the diffusion process happens in the latent space, and that makes sense to me. So we can not expect the final output produced by the latent feature that is corrupted by random noise to represent any reasonable contents until it is finally de-noised.

I'm still trying to figure out the details of the entire process, and the code is helping a lot. Thanks again for the open source, really appreciate that!

duanyiqun commented 1 year ago

Hi, TangTao (this comments disappeared which is quite strange)

Thank you for the question. I didn't find the question under this issue, so I just replied to it through mail. I think the output of conv_inv_transform is the real depth value after 1/conv_inv_transform.clamp(1/max_depth). For this case eps is 0.001 which means largest value output would be 1000m.

Best regards

Message ID: @.***>