YvanYin / Metric3D

The repo for "Metric3D: Towards Zero-shot Metric 3D Prediction from A Single Image" and "Metric3Dv2: A Versatile Monocular Geometric Foundation Model..."
https://jugghm.github.io/Metric3Dv2/
BSD 2-Clause "Simplified" License
1.34k stars 100 forks source link

iters parameter in RAFT DPT decoder #124

Open Sparsh913 opened 3 months ago

Sparsh913 commented 3 months ago

Hi Authors,

Awesome work! I've been playing around with the training pipeline and struck upon an interesting observation. While using dinov2_vitl14_reg4_pretrain.pth weights for the encoder, if I set cfg.model.decode_head.iters as 4 in the config file, the losses go NaN after a few epochs. But if I use 8 iters, the training runs fine. I understand that more iters reflect finer depth prediction from the decoder head but could you share your insights on why less iters might cause losses to go NaN?

Thanks!

JUGGHM commented 3 months ago

This is something strange but interesting. Currently I have no idea yet. We do observe that the first two iterations are the most important and after that, the refinements become incremental. But this can not illustrate your case.

As all outputs have been supervised, the decoder parameters should not be so sensitive to iteration changes.

Did you have surface normal labels in your datasets? If not, maybe the depth-normal consistency loss should be maintained. If you want to remove the normal branch, I may suggest that in the initial stages, the En-decoder should be frozen first .Only the regression layers shall be active. After some steps, you can unfreeze the other parts when the network becomes stable. Another possibly is that simpler losses (like L1 only) could alleviate the problem?