DepthAnything / Depth-Anything-V2

Depth Anything V2. A More Capable Foundation Model for Monocular Depth Estimation
https://depth-anything-v2.github.io
Apache License 2.0
2.09k stars 140 forks source link

Worse metric depth than V1? #26

Open BaderTim opened 1 week ago

BaderTim commented 1 week ago

Hi there! First of all, thanks for publishing your updated work. The details captured by this version are amazing.

However, on my own benchmark dataset, which is based on sparse single-distance lidar ground truth, I get far worse results on metric depth (using max depth 64). Like - 3 times as bad.

Is this expected because of the synthetic data? I thought that using ViT-G as teacher model has solved the distribution shift.

Thanks!

LiheYoung commented 6 days ago

Hi, could you please provide more details? E.g., what do you mean by '3 times as bad'?

According to our V2 results on standard benchmarks (Table 2), such as NYUv2 and KITTI, the aligned metric depth should be at least comparable with our V1 models.

BaderTim commented 6 days ago

Hi, thanks for your response.

I am using images from my own camera and resize/crop them to the KITTI format (while keeping my horizontal fov around 90°). For every image, there are a few single distance points captured, ranging from 6m to 120m in total.

When evaluating on those, V1 produces a RMSE around 7 while V2 achieves only around 19.

I am using the inference codes from run.py in my evaluation code. For V1 I could not define a maximum distance, for V2 I used 128.

myasincifci commented 2 days ago

I have a similar experience using the metric_depth fine-tuning code provided in the v2 repo. However, when I use the ZoeDepth code from the v1 repo with v2 weights I get a slightly better result. I am not particularly familiar with the network architecture so I might be mistaken but it seems to me that the metric ft pipeline in v2 only uses a DPT decoder head which is directly trained for metric depth while the ZoeDepth head used in v1 uses metric bins to convert relative to metric.

@LiheYoung I am still a little confused about the metric fine-tuning because in the v2 paper it is stated that ZoeDepth is used but the code in the repo seems to be missing the metrics bins. Or am I missing something?