SwinTransformer / MIM-Depth-Estimation

This is an official implementation of our CVPR 2023 paper "Revealing the Dark Secrets of Masked Image Modeling" on Depth Estimation.
MIT License
158 stars 11 forks source link

Revealing the Dark Secrets of Masked Image Modeling (Depth Estimation) [Paper]

PWC PWC

Main results

Results on NYUv2

Backbone d1 d2 d3 abs_rel rmse rmse_log
Swin-v2-Base 0.935 0.991 0.998 0.044 0.304 0.109
Swin-v2-Large 0.949 0.994 0.999 0.036 0.287 0.102

Results on KITTI

Backbone d1 d2 d3 abs_rel rmse rmse_log
Swin-v2-Base 0.976 0.998 0.999 0.052 2.050 0.078
Swin-v2-Large 0.977 0.998 1.000 0.050 1.966 0.075

Preparation

Please refer to [GLPDepth] for configuring the environment and preparing the NYUV2 and KITTI datasets. You can download pretrained models and our well-trained models from zoo(OneDrive).

Training

Evaluation

Citation

@article{xie2023darkmim,
  title={Revealing the Dark Secrets of Masked Image Modeling},
  author={Zhenda Xie, Zigang Geng, Jingcheng Hu, Zheng Zhang, Han Hu, Yue Cao},
  journal={arXiv preprint arXiv:2205.13543},
  year={2022}
}

Acknowledge

Our code is mainly based on GLPDepth[1]. The code of the model is from SwinTransformer[2] and Simple Baseline[3].

[1] Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth. [code]

[2] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. [code]

[3] Simple Baselines for Human Pose Estimation and Tracking. [code]