Sungmin-Woo / ProDepth

[ECCV 2024] ProDepth: Boosting Self-Supervised Multi-Frame Monocular Depth with Probabilistic Fusion
https://sungmin-woo.github.io/prodepth/
MIT License
22 stars 0 forks source link

Regarding the monocular pre-trained model #3

Closed jiakaa closed 1 month ago

jiakaa commented 1 month ago

If one aims to learn dynamic objects depth, it is necessary for a multi-view depth network to learn the depth of dynamic objects from a monocular network. However, the code suggests that it is best not to train the monocular network simultaneously; it needs to be frozen and requires a pre-trained model. If the pre-trained monocular network is to have good depth for dynamic objects, it needs to be trained with ground truth supervision. Can this work be understood as an indirect semi-supervised training rather than a complete self-supervised training? May I ask , if I do not have a pre-trained monocular model (i.e., not freezing the monocular and turning on “mono_losses”), would the efforts in dynamic objects depth be in vain?

Sungmin-Woo commented 1 month ago

Hi @jiakaa, thanks for your interest!

The single-frame network is pretrained using only the given images, without ground-truth supervision. For dynamic objects, the single-frame network without cost volume makes significantly fewer mistakes compared to the multi-frame network, as it avoids misaligned feature matching between frames during cost volume construction. This observation has been noted in several previous methods, which use single-frame depth to compensate for the errors of multi-frame depth in dynamic regions. We recommend reading Section 4.4 of the ManyDepth paper for more details.

However, single-frame depth still involves some errors in non-static pixels, as the optimization of photometric reprojection loss provides incorrect self-supervision loss for moving objects. Existing methods commonly use auto-masking proposed by MonoDepth2 to alleviate this problem, whereas we use an uncertainty-aware loss reweighting, which better excludes potential dynamic regions during training. The single-frame depth learned with these strategies shows relatively correct depth for dynamic objects.

The pretrained single-frame network is provided to facilitate the stable and fast optimization of the multi-frame network. You can train these two networks simultaneously (with modified training hyperparameters) and achieve the same results.

jiakaa commented 1 month ago

Thanks for your explanation, I get it now.