πππ Welcome to the NVDS GitHub repository! πππ
The repository is official PyTorch implementation of ICCV 2023 paper "Neural Video Depth Stabilizer" (NVDS)
Authors: Yiran Wang1, Min Shi1, Jiaqi Li1, Zihao Huang1, Zhiguo Cao1, Jianming Zhang2, Ke Xian3, Guosheng Lin3
TPAMI 2024 "NVDS+: Towards Efficient and Versatile Neural Stabilizer for Video Depth Estimation" (NVDS+)
Authors: Yiran Wang1, Min Shi1, Jiaqi Li1, Chaoyi Hong1, Zihao Huang1, Juewen Peng3, Zhiguo Cao1, Jianming Zhang2, Ke Xian1, Guosheng Lin3
Institutes: 1Huazhong University of Science and Technology, 2Adobe Research, 3Nanyang Technological University
NVDS is the first plug-and-play stabilizer that can remove flickers from any single-image depth model without extra effort. Besides, we also introduce a large-scale dataset, Video Depth in the Wild (VDW), which consists of 14,203 videos with over two million frames, making it the largest natural-scene video depth dataset. Don't forget to star this repo if you find it interesting!
VDW dataset.
We have released the VDW dataset under strict conditions. We must ensure that the releasing wonβt violate any copyright requirements. To this end, we will not release any video frames or the derived data in public. Instead, we provide meta data and detailed toolkits, which can be used to reproduce VDW or generate your own data. The meta data contains IMDB numbers, starting time, end time, movie durations, resolutions, and cropping areas. All the meta data and toolkits are licensed under CC BY-NC-SA 4.0, which can only be used for academic and research purposes. Please refer to our VDW official website and VDW Toolkits for data usage.
NVDS code and model.
Following MiDaS and CVD, NVDS model adopts the widely-used MIT License.
Video depth estimation aims to infer temporally consistent depth. Some methods achieve temporal consistency by finetuning a single-image depth model during test time using geometry and re-projection constraints, which is inefficient and not robust. An alternative approach is to learn how to enforce temporal consistency from data, but this requires well-designed models and sufficient video depth data. To address these challenges, we propose a plug-and-play framework called Neural Video Depth Stabilizer (NVDS) that stabilizes inconsistent depth estimations and can be applied to different single-image depth models without extra effort. We also introduce a large-scale dataset, Video Depth in the Wild (VDW), which consists of 14,203 videos with over two million frames, making it the largest natural-scene video depth dataset to our knowledge. We evaluate our method on the VDW dataset as well as two public benchmarks and demonstrate significant improvements in consistency, accuracy, and efficiency compared to previous approaches. Our work serves as a solid baseline and provides a data foundation for learning-based video depth models. We will release our dataset and code for future research.
Basic environment.
Our code is based on python=3.8.13
and pytorch==1.9.0
. Refer to the requirements.txt
for installation.
conda create -n NVDS python=3.8.13
conda activate NVDS
conda install pytorch==1.9.0 torchvision==0.10.0 cudatoolkit=11.1 -c pytorch -c conda-forge
pip install numpy imageio opencv-python scipy tensorboard timm scikit-image tqdm glob h5py
Installation of GMflow.
We utilize state-of-the-art optical flow model GMFlow in the temporal loss and the OPW metric. The temporal loss is used to enhance consistency while training. The OPW metric is evaluated in our demo (inference) code to showcase quantitative improvements.
Please refer to the GMFlow Official Repo for the installation.
Installation of mmcv and mmseg.
Cross attention in our stabilization network contains functions based on mmcv-full==1.3.0
and mmseg==0.11.0
. Please refer to MMSegmentation-v0.11.0 and their official document for detailed installation instructions step by step. The key is to match the version of mmcv-full and mmsegmentation with the version of cuda and pytorch on your server. For instance, I have CUDA 11.1
and PyTorch 1.9.0
on my server, thus mmcv-full 1.3.x
and mmseg 0.11.0
(as in our installation instructions) are compatible with my environment (confirmed by mmcv-full 1.3.x). Different servers adopt different Cuda versions, thus I can not specify the specific installation for all people. You should check the matching version of your own server on the official documents of mmcv-full and mmseg. You can choose different versions in their documents and check the version matching relations. By reading and following the detailed mmcv-full and mmseg documents, the installation seems to be easy. You can also refer to Issue #1 for some discussions.
Besides, we suggest you to install mmcv-full==1.x.x
, because some API or functions are removed in mmcv-full==2.x.x
(you need to adjust our code for mmcv-full==2.x.x).
Preparing Demo Videos.
We put 8 demo input videos in demo_videos
folder, in which bandage_1
and market_6
are examples of MPI Sintel dataset. motocross-jump
is from DAVIS dataset. Others are a few examples of our VDW test dataset. You can also prepare your own testing sequences like us.
Downloading checkpoints of depth predictors.
In our demo, we adopt MiDaS and DPT as different depth predictors. We use midas_v21-f6b98070.pt and dpt_large-midas-2f21e586.pt. Download those checkpoints and put them in dpt/checkpoints/
folder. You may need to modify the MiDaS checkpoint name (midas_v21_384.pt) or our code (midas_v21-f6b98070.pt) since its name is adjusted by the MiDaS repo.
Preparing checkpoint of NVDS Stabilizer.
Download and put the NVDS_Stabilizer.pth
in NVDS_checkpoints/
folder.
Running NVDS Inference Demo.
infer_NVDS_dpt_bi.py
and infer_NVDS_midas_bi.py
use DPT and Midas as depth predictors. Those scripts contain: (1) NVDS Bidirectional Inference; (2) OPW Metric Evaluations with GMFlow. The only difference between those two scripts is the depth predictor. For running the code, taking DPT as an example, the basic command is:
CUDA_VISIBLE_DEVICES=0 python infer_NVDS_dpt_bi.py --base_dir /XXX/XXX --vnum XXX --infer_w XXX --infer_h XXX
--base_dir
represents the folder to save results. --vnum
refer to the video numbers or names. --infer_w
and --infer_h
are the width and height for inference. We use --infer_h 384
by default. The --infer_w
is set to maintain the aspect ratio of original videos. Besides, the --infer_w
and --infer_h
should be set to integer multiples of 32
for alignment of resolutions in the up-sampling and down-sampling processes.
Specifically, for the videos of VDW test dataset (000423
as an example):
CUDA_VISIBLE_DEVICES=0 python infer_NVDS_dpt_bi.py --base_dir ./demo_outputs/dpt_init/000423/ --vnum 000423 --infer_w 896 --infer_h 384
CUDA_VISIBLE_DEVICES=0 python infer_NVDS_midas_bi.py --base_dir ./demo_outputs/midas_init/000423/ --vnum 000423 --infer_w 896 --infer_h 384
For the videos of Sintel dataset (market_6
as an example):
CUDA_VISIBLE_DEVICES=0 python infer_NVDS_dpt_bi.py --base_dir ./demo_outputs/dpt_init/market_6/ --vnum market_6 --infer_w 896 --infer_h 384
CUDA_VISIBLE_DEVICES=0 python infer_NVDS_midas_bi.py --base_dir ./demo_outputs/midas_init/market_6/ --vnum market_6 --infer_w 896 --infer_h 384
For the videos of DAVIS dataset (motocross-jump
as an example):
CUDA_VISIBLE_DEVICES=0 python infer_NVDS_dpt_bi.py --base_dir ./demo_outputs/dpt_init/motocross-jump/ --vnum motocross-jump --infer_w 672 --infer_h 384
CUDA_VISIBLE_DEVICES=0 python infer_NVDS_midas_bi.py --base_dir ./demo_outputs/midas_init/motocross-jump/ --vnum motocross-jump --infer_w 672 --infer_h 384
Under the resolution of $896\times384$, the inference of DPT-Large and our stabilizer takes about 20G and 5G GPU memory (RTX-A6000). If the GPU memory or inference latency seems large for your applications, you can (1) run DPT/Midas initial depth results and our NVDS separately; (2) reduce the inference resolution ($e.g.$, $384\times384$); (3) if not needed, remove the OPW evaluations, in which the inference of GMFlow also brings some computational costs. (4) if not needed, remove the bidirectional (backward and mixing) inference. The forward inference process can also produce satisfactory results, while bidirectional inference can further improve consistency.
After running the inference code, the result folder --base_dir
will be organized as follows:
demo_outputs/dpt_init/000423/
ββββ result.txt
βββ initial/
βββ color/
βββ gray/
βββ 1/
βββ color/
βββ gray/
βββ 2/
βββ color/
βββ gray/
βββ mix/
βββ color/
βββ gray/
result.txt
contains the OPW evaluations of initial depth (depth predictor, initial/
), NVDS forward predictions (1/
), backward predictions (2/
), and final bidirectional results (mix/
). color
contains depth visualizations and gray
contains depth results in uint16 format (0-65535).
Video Comparisons.
After getting the results, video comparisons can be generated and saved in demo_outputs_videos/
:
python pic2v.py --vnum 000423 --infer_w 896 --infer_h 384
python pic2v.py --vnum market_6 --infer_w 896 --infer_h 384
python pic2v.py --vnum motocross-jump --infer_w 672 --infer_h 384
We show 8 video comparisons in demo_outputs_videos/
. The first row is RGB video, the second row is initial depth (DPT and MiDaS), and the third row is NVDS results with DPT and MiDaS as depth predictors. To ensure the correctness of your running results, you can compare the results you obtained with demo_outputs_videos
and demo_outputs
(png results). We show png results of the 8 videos by LINK. Besides, you are also encouraged to modify our code to stabilize your own depth predictors and discuss the results with us. We hope our work can serve as a solid baseline for future works in video depth estimation and other relevant tasks.
Preparing 654 testing sequences.
Download the 654 testing sequences from LINK. Put the sequences in the ./test_nyu_data
folder. The ./test_nyu_data
folder should only contain the 654 folders of all testing sequences. The folder of each sequence is organized by:
test_nyu_data/1/
βββ rgb/
βββ 000000.png 000001.png 000002.png 000003.png
βββ gt/
βββ 000003.png
We follow the commonly-applied Eigen split with 654 images for testing. In our case, we locate each image (000003.png)
in its video and use its previous three frames (000000.png, 000001.png, and 000002.png)
as reference frames.
Preparing NVDS checkpoint finetuned on NYUDV2.
Download and put the NVDS_Stabilizer_NYUDV2_Finetuned.pth
in NVDS_checkpoints/
folder.
Evaluations with Midas and DPT as different depth predictors.
Run test_NYU_depth_metrics.py
with specified depth predictors (--initial_type dpt
or midas
).
CUDA_VISIBLE_DEVICES=0 python test_NYU_depth_metrics.py --initial_type dpt
CUDA_VISIBLE_DEVICES=1 python test_NYU_depth_metrics.py --initial_type midas
The test_NYU_depth_metrics.py
contains three parts: (1) Inference of depth predictors, producing initial results of Midas or DPT; (2) Inference of NVDS based on the initial results; (3) Metric evaluations of depth predictor and NVDS. All inference processes are conducted by the resolution of $384\times384$ as Midas and DPT. For simplicity, we only adopt NVDS forward prediction in this code. By running the code, you can reproduce similar results as our paper:
Methods | $\delta_1$ | $Rel$ | Methods | $\delta_1$ | $Rel$ |
---|---|---|---|---|---|
Midas | 0.910 | 0.095 | DPT | $0.928$ | $0.084$ |
NVDS (Midas) | 0.941 | 0.076 | NVDS (DPT) | 0.950 | 0.072 |
After running the evaluation code, the test_nyu_data
will be organized by:
test_nyu_data/1/
βββ rgb/
βββ 000000.png 000001.png 000002.png 000003.png
βββ gt/
βββ 000003.png
βββ initial_midas/
βββ 000000.png 000001.png 000002.png 000003.png
βββ initial_dpt/
βββ 000000.png 000001.png 000002.png 000003.png
βββ NVDS_midas/
βββ 000003.png
βββ NVDS_dpt/
βββ 000003.png
We evaluate depth metrics of all methods only using the 654 images in Eigen split, i.e., 000003.png
of each sequence. 000000.png, 000001.png, and 000002.png
are produced by depth predictors as the input of the stabilization network.
Applying for the VDW test set.
Here we take /xxx/vdw_test
as an example. The VDW test set contains 90 videos with 12,622 frames. For each video (e.g., /xxx/vdw_test/000008/
), the test set is organized as follows. The left
or right
folders contain the RGB video frames of left and right views, while gt folders are for disparity annotations and mask folders for valid masks.
/xxx/vdw_test/000008/
βββ left/
βββ frame_000000.png frame_000001.png frame_000002.png ...
βββ left_gt/
βββ frame_000000.png frame_000001.png frame_000002.png ...
βββ left_mask/
βββ frame_000000.png frame_000001.png frame_000002.png ...
βββ right/
βββ frame_000000.png frame_000001.png frame_000002.png ...
βββ right_gt/
βββ frame_000000.png frame_000001.png frame_000002.png ...
βββ right_mask/
βββ frame_000000.png frame_000001.png frame_000002.png ...
Inference and evaluations for each test video.
For each test video, the evaluations contain two steps: (1) inference; and (2) depth metrics evaluations. We provide the write_sh.py
to generate evaluation scripts (for Midas and DPT). You should modify some details in write_sh.py
(e.g., gpu number, VDW test set path, directory for saving NVDS results with Midas/DPT) in order to generate the test_VDW_NVDS_Midas.sh
and test_VDW_NVDS_DPT.sh
. We provide the two example scripts with /xxx/
for those directories.
To be specific, (1) the inference step is the same as the previous Demo & Inference
part with infer_NVDS_dpt_bi.py
and infer_NVDS_midas_bi.py
. In this step, the temporal metric OPW
is automatically evaluated and saved in the result.txt
. (2) Depth metrics evaluations utilize the vdw_test_metric.py
to calculate $\delta_1$ and $Rel$ for each video. Taking ./vdw_test/000008/
as an example, --gt_dir
specifies the path for vdw_test
, --result_dir
specifies your directory for saving results, and --vnum
represents the video number.
python vdw_test_metric.py --gt_dir /xxx/vdw_test/ --result_dir /xxx/NVDS_VDW_Test/Midas/ --vnum 000008
python vdw_test_metric.py --gt_dir /xxx/vdw_test/ --result_dir /xxx/NVDS_VDW_Test/DPT/ --vnum 000008
After generating test_VDW_NVDS_Midas.sh
and test_VDW_NVDS_DPT.sh
, you can run inference and evaluations for all the videos by:
bash test_VDW_NVDS_Midas.sh
bash test_VDW_NVDS_DPT.sh
Average metrics calculations for all 90 videos.
When the scripts are finished for all videos, NVDS_VDW_Test
folder will contain the results of 90 test videos with Midas/DPT as depth predictors (/xxx/NVDS_VDW_Test/Midas/
and /xxx/NVDS_VDW_Test/DPT/
). For each video, there will be an accuracy.txt
to store the depth metrics. The last step is to calculate the average temporal and depth metrics for all the 90 videos. You can simply run the cal_mean_vdw_metric.py
for the final results.
python cal_mean_vdw_metric --test_dir /xxx/NVDS_VDW_Test/Midas/
python cal_mean_vdw_metric --test_dir /xxx/NVDS_VDW_Test/DPT/
Finally, you can get the same results as our paper. This also serves as an example to conduct evaluations on the VDW test set.
Methods | $\delta_1$ | $Rel$ | $OPW$ | Methods | $\delta_1$ | $Rel$ | $OPW$ |
---|---|---|---|---|---|---|---|
Midas | 0.651 | 0.288 | 0.676 | DPT | 0.730 | 0.215 | 0.470 |
NVDS-Forward (Midas) | 0.700 | 0.240 | 0.207 | NVDS-Forward (DPT) | 0.741 | 0.208 | 0.165 |
NVDS-Backward (Midas) | 0.699 | 0.240 | 0.218 | NVDS-Backward (DPT) | 0.741 | 0.208 | 0.174 |
NVDS-Final (Midas) | 0.700 | 0.240 | 0.180 | NVDS-Final (DPT) | 0.742 | 0.208 | 0.147 |
We thank the authors for releasing PyTorch, MiDaS, DPT, GMFlow, SegFormer, VSS-CFFM, Mask2Former, PySceneDetect, and FFmpeg. Thanks for their solid contributions and cheers to the community.
@InProceedings{NVDS,
author = {Wang, Yiran and Shi, Min and Li, Jiaqi and Huang, Zihao and Cao, Zhiguo and Zhang, Jianming and Xian, Ke and Lin, Guosheng},
title = {Neural Video Depth Stabilizer},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month = {October},
year = {2023},
pages = {9466-9476}}
@ARTICLE{NVDSPLUS,
author={Wang, Yiran and Shi, Min and Li, Jiaqi and Hong, Chaoyi and Huang, Zihao and Peng, Juewen and Cao, Zhiguo and Zhang, Jianming and Xian, Ke and Lin, Guosheng},
journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
title={NVDS$^{\mathbf{+}}$: Towards Efficient and Versatile Neural Stabilizer for Video Depth Estimation},
year={2024},
pages={1-18},
doi={10.1109/TPAMI.2024.3476387}}