Problem about fine tune to get Metric Depth Model

DepthAnything / Depth-Anything-V2

Depth Anything V2. A More Capable Foundation Model for Monocular Depth Estimation

https://depth-anything-v2.github.io

Apache License 2.0

3.39k stars 277 forks source link

Problem about fine tune to get Metric Depth Model #92

Open command-z-z opened 2 months ago

command-z-z commented 2 months ago

Thank you for the awesome project! I have a question, When you fine-tune the metric depth model, do you load the pre-trained depth anything v2 DINOv2 backbone and randomly initialize the DPT head(or other initialization?), or do you do any other special processing?

Edric-star commented 2 months ago

I want to query this as well, because I did the fine-tune with my sparse gt labels yet the results were horrible, not sure if there was something important that I might miss. I have tried to modify the lr of pre-trained model from 5e-06 to 1e-06, btw don't worry about my dataset size because I tried from 10,000 to 100,000 different amount, but still hard to lower the loss.(I supposed that it might had something to do with my sparse labels? I simply set the valid_mask as gt > 0 to ignore the pixels without depth values, but these pixels where gt > 0 occupied perhaps 10-20% perhaps in a picture, I kinda worried that it might be too sparse.)

I3aer commented 1 month ago

The answer is in the ZoeDepth. They train heads from scratch: "Our model first learns from a large variety of datasets in pre-training which leads to good generalization. In the second stage, we add heads for metric depth estimation to the encoder-decoder architecture and fine-tune them on metric depth datasets".

command-z-z commented 1 month ago

The answer is in the ZoeDepth. They train heads from scratch: "Our model first learns from a large variety of datasets in pre-training which leads to good generalization. In the second stage, we add heads for metric depth estimation to the encoder-decoder architecture and fine-tune them on metric depth datasets".

So, Does it just load encoder pre-trained weights and random initialize head network for fine tune?