:rotating_light: This repository contains download links to our dataset, code snippets, and trained deep stereo models of our work "NeRF-Supervised Deep Stereo", CVPR 2023
by Fabio Tosi1, Alessio Tonioni2, Daniele De Gregorio3 and Matteo Poggi1
University of Bologna1, Google Inc.2, Eyecan.ai3
We introduce a pioneering pipeline that leverages NeRF to train deep stereo networks without the requirement of ground-truth depth or stereo cameras. By capturing images with a single low-cost handheld camera, we generate thousands of stereo pairs for training through our NS paradigm. This approach results in state-of-the-art zero-shot generalization, surpassing both self-supervised and supervised methods.
Contributions:
Introducing a novel paradigm for collecting and generating stereo training data using neural rendering and a collection of user-captured image sequences. Our methodology revolutionizes stereo network training by leveraging readily available user-captured images, eliminating the need for synthetic datasets, ground-truth depth, or even real stereo pairs!
A NeRF-Supervised (NS) training protocol that combines rendered image triplets and depth maps to address occlusions and enhance fine details.
State-of-the art, zero-shot generalization results on challenging stereo datasets, without exploiting any ground-truth or real stereo pair.
:fountain_pen: If you find this code useful in your research, please cite:
@inproceedings{Tosi_2023_CVPR,
author = {Tosi, Fabio and Tonioni, Alessio and De Gregorio, Daniele and Poggi, Matteo},
title = {NeRF-Supervised Deep Stereo},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2023},
pages = {855-866}
}
We collect a total of 270 high-resolution (8Mpx) scenes in both indoor and outdoor environments using standard camera-equipped smartphones. For each scene, we focus on a/some specific object(s) and acquire 100 images from different viewpoints, ensuring that the scenery is completely static. The acquisition protocol involves a set of either front-facing or 360Β° views.
Examples of scenes in our dataset. Here we report individual examples derived from 30 different scenes that comprise our dataset.
After downloading the dataset from the provided link, you will find two folders:
π raw_data_v1: This folder contains zip files for each of the 270 scenes. Inside each zip file, you will find raw RGB images at high-resolution captured using our smartphones, which were later used to generate stereo pairs using NeRF. It also includes the camera poses obtained through COLMAP.
πstereo_dataset_v1: In this folder, you will find zip files for each of the 270 scenes. Inside each zip file, you will find rendered stereo pairs generated by NeRF, along with corresponding disparity maps and AO maps. The disparity maps and AO maps are saved as 16-bit images. To obtain the actual disparity values, please divide the disparity map values by a factor of 64. Similarly, divide the AO map values by a factor of 65536. Please note that due to space constraints, we are providing the stereo images at 0.5Mpx only, used to train the stereo models released.
β οΈ Disparity Map Alignment: All the disparity maps provided in the dataset are aligned with the 'center' image of each triplet.
Please refer to the dataset documentation for more detailed instructions on using the dataset effectively.
Here, you can download the weights of RAFT-Stereo and PSMNet architectures. These models were trained from scratch on rendered triplets of our real-world dataset using our NeRF-Supervised training loss.
To use these weights, please follow these steps:
weights
in the project directory.weights
folder.The Test section provides scripts to evaluate disparity estimation models on datasets like KITTI, Middlebury, and ETH3D. It helps assess the accuracy of the models and saves predicted disparity maps.
The Demo section allows you to quickly generate a disparity map for a pair of stereo images.
Please refer to each section for detailed instructions on setup and execution.
Dependencies: Ensure that you have installed all the necessary dependencies. The list of dependencies can be found in the ./code_snippets/requirements.txt
file.
Clone the repositories:
For RAFT-Stereo:
git clone https://github.com/princeton-vl/RAFT-Stereo
core
folder.(Please note that we have made modifications to the RAFT-Stereo implementation. Specifically, we have modified a line of code in the raft_stereo.py
file.
Previously, the code at line 136 in the raft_stereo.py
file read as follows:)
flow_predictions.append(flow_up)
We have made the following change:
flow_predictions.append(-flow_up)
β οΈ Warning: during the evaluation phase, please ensure that you modify the iters
parameter in the original RAFT-Stereo code from the default value of 12 to 32, as indicated in the original RAFT-Stereo paper.
For PSMNet:
git clone https://github.com/JiaRenChang/PSMNet
models
folder.Paste files: Paste the copied contents into the ./models/raft-stereo
or ./models/psmnet
folder in your project directory.
This code snippet allows you to evaluate the disparity maps on various datasets, including KITTI (2012 and 2015), Middlebury (Training, Additional, 2021), and ETH3D. By executing the provided script, you can assess the accuracy of disparity estimation models on these datasets.
To run the test.py
script with the correct arguments, follow the instructions below:
Run the test:
test.py
script.Execute the command: Run the following command, replacing the placeholders with the actual values for your images and model:
python test.py --datapath <path_to_dataset> --dataset <dataset_type> --version <dataset_version> --model <model_name> --loadmodel <path_to_pretrained_model> --maxdisp <max_disparity> --outdir <output_directory> --occ
Replace the placeholders (
The available arguments are:
--datapath
: Path to the dataset.--dataset
: Dataset type. (e.g., middlebury
, kitti
)--version
: Specify the dataset version.--model
: Select the model. Options: raft-stereo
, psmnet
--outdir
: Output directory to save the disparity maps.--loadmodel
: Path to the pretrained model file.--occ
: Include occluded regions in the evaluation process.--maxdisp
: Maximum disparity value (default 256).For more details, please refer to the test.sh
script in the code_snippet
folder.
You can use the demo.py
script to estimate a disparity map from a single stereo pair. The script will run and produce the predicted disparity, which will be saved at the specified output path. Follow the instructions below to run the demo:
Run the demo:
demo.py
script.Execute the command: Run the following command, replacing the placeholders with the actual values for your images and model:
python demo.py --left <path_to_left_image> --right <path_to_right_image> --output <path_to_output_disparity> --model <model_name> --loadmodel <path_to_pretrained_model> --maxdisp <max_disparity>
--left
: Path to the left image.--right
: Path to the right image.--output
: Path to save the predicted disparity map.--model
: Select the model. Options: psmnet
or raft-stereo
.--loadmodel
: Path to the pretrained model file.--maxdisp
: Maximum disparity value (default 256).Make sure to replace the placeholders <path_to_left_image>
, <path_to_right_image>
, <path_to_output_disparity>
, <model_name>
, <path_to_pretrained_model>
, and <max_disparity>
with the actual values for your images and model.
Example: You can try the trained deep stereo models using the sample stereo pair from the Middlebury dataset available in the images
folder. You can use the provided im0.png
and im1.png
images as follows:
python demo.py --left images/im0.png --right images/im1.png --output images/disparity_map.png --model raft-stereo --loadmodel ./weights/raftstereo-NS.tar
This command will estimate a disparity map using the selected deep stereo model (raft-stereo
) on the provided stereo pair (im0.png
and im1.png
). The predicted disparity will be saved as disparity_map.png
in the images folder.
If you haven't downloaded the pretrained models yet, you can find the download links in the Pretrained Models section above.
To train a NeRF model starting from a scene captured with a single camera, you can utilize various NeRF implementations available. One such implementation that we have used in our experiments is Instant-NGP. Instant-NGP offers high accuracy and fast training times, making it suitable for training multiple NeRF models and rendering thousands of images quickly.
Please refer to the Instant-NGP repository and follow their instructions for training NeRF models. While we used Instant-NGP in our experiments, you are free to choose any other NeRF implementation that suits your needs.
In addition, we provide a code snippet named generate_stereo_pair_matrix.py
in the code_snippets
folder. This code is used to generate stereo pairs from a transform.json file, which is typically used in Instant-NGP. You can use this code to facilitate the creation of stereo pairs for your NeRF training. Feel free to customize and adapt it according to your specific requirements.
In this section, we present illustrative examples that demonstrate the effectiveness of our proposal.
Arbitrary Baseline. Here, we show the remarkable capability of NeRF to effortlessly produce stereo pairs with arbitrary baseline configurations, employing them on a diverse array of scenes captured from our newly curated collection of images.
Examples of Rendered Images and Depth from NeRF. We show examples on a scene of our dataset. In each case, the leftmost and rightmost columns show the rendered left and right images in a triplet, respectively. These images were obtained using small, medium, and large baselines, as indicated by the red, green, and blue lines. The center column, from top to bottom, shows the center image in the triplet, its corresponding rendered disparity map, and ambient occlusion map. Here, we adopt the Instant-NGP framework to render images.
Effect of Training Losses. From left to right: reference image, disparity maps computed by the RAFT-Stereo network trained using the popular binocular photometric loss between two images of a rectified stereo pair, the triplet photometric loss between three horizontally aligned images, the proxy-supervised loss from Aleotti et al., ECCV 2020 and, finally, our proposed NeRF-Supervised loss. Please zoom-in to better perceive fine-details.
Qualitative Comparison on Midd-A H (top) and Midd-21 (bottom). From left to right: left images and disparity maps by RAFT-Stereo models, respectively trained with MfS or NS. Under each disparity map, the percentage of pixels with error > 2.
For questions, please send an email to fabio.tosi5@unibo.it or m.poggi@unibo.it
() This is not an officially supported Google product.*
We would like to extend our sincere appreciation to the authors of the following projects for making their code available, which we have utilized in our work:
We would like to thank the authors of PSMNet, RAFT-Stereo and CFNet for providing their code, which has been instrumental in our stereo matching experiments.
We express our gratitude to the authors of Instant-NGP for releasing their code, enabling us to train accurate NeRF models efficiently.
We would like to thank the authors of Stereo-from-Mono for their code, which has been valuable in our evaluation of disparity maps.
We deeply appreciate the authors of the competing research papers for their helpful responses, and provision of model weights, which greatly aided accurate comparisons.