DepthAnything / Depth-Anything-V2

[NeurIPS 2024] Depth Anything V2. A More Capable Foundation Model for Monocular Depth Estimation
https://depth-anything-v2.github.io
Apache License 2.0
3.95k stars 341 forks source link

Your code has sigmoid output for metric depth #154

Open adizhol-str opened 2 months ago

adizhol-str commented 2 months ago

https://github.com/DepthAnything/Depth-Anything-V2/blob/31dc97708961675ce6b3a8d8ffa729170a4aa273/metric_depth/depth_anything_v2/dpt.py#L113

Sigmoid() layer is being used for the metric depth architecture. This doesn't make sense to me. Can you explain? (the relative depth architecture which should return inverse depth in0 [0...1] doesn't have a sigmoid in the output)

Thank you

ZYX-MLer commented 2 months ago

I also have a question about this, why is it not a linear regression layer?

adizhol-str commented 2 months ago

I also have a question about this, why is it not a linear regression layer?

A relu or sigmoid should be there to keep the output positive (for either depth or disparity)

LiheYoung commented 2 months ago

Hi @adizhol-str and @ZYX-MLer, in metric depth estimation, it's common practice to use a sigmoid function to map the output to the 0-1 range. This is because the metric depth values fall within the range of 0 to max_depth. We use a sigmoid function to map the output to 0-1, then multiply it by a pre-defined max_depth. However, for relative depth estimation, inverse depth values can range from 0 to infinity, so we use a ReLU function instead.

adizhol-str commented 2 months ago

@LiheYoung Thank you for clarifying. It is confusing since it is mentioned in the DepthAnything paper as well as in MiDaS, that the inverse depth is scaled between [0...1] for the relative depth training.

I3aer commented 1 month ago

I want to ask a question about the following text:

"Concretely, the depth value is first transformed into the disparity space by d=1/t and then normalized to 0∼1 on each depth map. To enable multi-dataset joint training, we adopt the affine-invariant loss to ignore the unknown scale and shift of each sample"

Are the outputs from the relative depth map models d=1/t? That is, they are normalized to 0~1 for just computation of the loss function.