Open BaderTim opened 1 week ago
Hi, could you please provide more details? E.g., what do you mean by '3 times as bad'?
According to our V2 results on standard benchmarks (Table 2), such as NYUv2 and KITTI, the aligned metric depth should be at least comparable with our V1 models.
Hi, thanks for your response.
I am using images from my own camera and resize/crop them to the KITTI format (while keeping my horizontal fov around 90°). For every image, there are a few single distance points captured, ranging from 6m to 120m in total.
When evaluating on those, V1 produces a RMSE around 7 while V2 achieves only around 19.
I am using the inference codes from run.py in my evaluation code. For V1 I could not define a maximum distance, for V2 I used 128.
I have a similar experience using the metric_depth fine-tuning code provided in the v2 repo. However, when I use the ZoeDepth code from the v1 repo with v2 weights I get a slightly better result. I am not particularly familiar with the network architecture so I might be mistaken but it seems to me that the metric ft pipeline in v2 only uses a DPT decoder head which is directly trained for metric depth while the ZoeDepth head used in v1 uses metric bins to convert relative to metric.
@LiheYoung I am still a little confused about the metric fine-tuning because in the v2 paper it is stated that ZoeDepth is used but the code in the repo seems to be missing the metrics bins. Or am I missing something?
Hi there! First of all, thanks for publishing your updated work. The details captured by this version are amazing.
However, on my own benchmark dataset, which is based on sparse single-distance lidar ground truth, I get far worse results on metric depth (using max depth 64). Like - 3 times as bad.
Is this expected because of the synthetic data? I thought that using ViT-G as teacher model has solved the distribution shift.
Thanks!