Paper | Video | Project Page
This is the official implementation for the 3DV 2024 paper:
S4C: Self-Supervised Semantic Scene Completion with Neural Fields
Adrian Hayler ,1, Felix Wimbauer ,1 , Dominik Muhle1, Christian Rupprecht2 and Daniel Cremers1
1Technical University of Munich, 2Munich Center for Machine Learning, 3University of Oxford
If you find our work useful, please consider citing our paper:
@article{hayler2023s4c,
title={S4C: Self-Supervised Semantic Scene Completion with Neural Fields},
author={Hayler, Adrian and Wimbauer, Felix and Muhle, Dominik and Rupprecht, Christian and Cremers, Daniel},
journal={arXiv preprint arXiv:2310.07522},
year={2023}
}
Our proposed method can reconstruct a scene from a single image and only relies on videos and pseudo segmentation ground truth generated from off-the-shelf image segmentation network during training. Unlike existing methods, which use discrete voxel grids, we represent scenes as implicit semantic fields. This formulation allows querying any point within the camera frustum for occupancy and semantic class. Our architecture is trained through rendering-based self-supervised losses. Nonetheless, our method achieves performance close to fully supervised state-of-the-art methods. Additionally, our method demonstrates strong generalization capabilities and can synthesize accurate segmentation maps for far away viewpoints.
a) From an input image $\textbf{I}_\textbf{I}$, an encoder-decoder network predicts a pixel-aligned feature map $\textbf{F}$ describing a semantic field in the frustum of the image. The feature $f_{\textbf{u}_i}$ of pixel $\textbf{u}_i$ encodes the semantic and occupancy distribution on the ray cast from the optical center through the pixel. b) The semantic field allows rendering novel views and their corresponding semantic segmentation via volumetric rendering. A 3D point $\textbf{x}_i$ is projected into the input image and therefore $\textbf{F}$ to sample $f_{\textbf{u}_i}$. Combined with positional encoding of $\textbf{x}_i$, two MLPs decode the density of the point $\sigma_i$ and semantic label $l_i$, respectively. The color $c_i$ for novel view synthesis is obtained from other images via color sampling. c) To achieve best results, we require training views to cover as much surface of the scene as possible. Therefore, we sample side views from random future timesteps, that observe areas of the scene that are occluded in the input frame.
We use Conda to manage our Python environment:
conda env create -f environment.yml
Then, activate the conda environment :
conda activate s4c
All non-standard data (like precomputed poses and datasplits) comes with this repository and can be found in the datasets/
folder.
In addition, please adjust the data_path
and data_segmentation_path
in configs/data/kitti_360.yaml
.\
We explain how to obtain these datasets in sections KITTI-360
and Pseudo-Ground-Truth Segmentation masks.
For the data_path
the folder you link to should have the following structure:
calibration
data_2d_raw
data_2d_semantics
data_3d_bboxes
data_3d_raw
data_poses
For the data_segmentation_path
the folder you link to should have the following structure:
2013_05_28_drive_0000_sync 2013_05_28_drive_0004_sync 2013_05_28_drive_0007_sync
2013_05_28_drive_0002_sync 2013_05_28_drive_0005_sync 2013_05_28_drive_0009_sync
2013_05_28_drive_0003_sync 2013_05_28_drive_0006_sync 2013_05_28_drive_0010_sync
To download KITTI-360, go to https://www.cvlibs.net/datasets/kitti-360/index.php and create an account. We require the perspective images, fisheye images, raw velodyne scans, calibrations, and vehicle poses.
You can download the pseudo-ground-truth segmentation masks from here.
Alternatively, you can generate them yourself. For this we use the Panoptic Deeplab model zoo (CVPR 2020).
First create and activate a new conda environment following the instructions laid out here. \
You can find the requirements.txt
file under \datasets\panoptic-deeplab\requirements.txt
.
You also need to download the R101-os32 cityscapes baseline model.
Afterwards, you can run:
python <path-to-script>/preprocess_kitti_360_segmentation.py \
--cfg datasets/panoptic-deeplab/configs/panoptic_deeplab_R101_os32_cityscapes.yaml \
--output-dir <path-to-output-directory> \
--checkpoint <path-to-downloaded-model>/panoptic_deeplab_R101_os32_cityscapes.pth
The training configuration for the model reported on in the paper can be found in the configs
folder.
Generally, all trainings are run on a single Nvidia A40 GPU with 48GB memory.
For faster convergence and slightly better results, we use the pretrained model from BehindTheScenes
as a backbone from which we start our training. To download the backbone please run:
./download_backbone.sh
KITTI-360
python train.py -cn exp_kitti_360
You can download our pretrained model from here.
We provide a script to run our pretrained models with custom data.
The script can be found under scripts/images/gen_img_custom.py
and takes the following flags:
--img <path>
/ i <path>
: Path to input image. The image will be resized to match the model's default resolution.--plot
/ -p
: Plot outputs instead of saving them.--model
/ -m
: Path to the model you want to use.media/example/
contains two example images. Note that we use the default projection matrices for the respective datasets
to compute the density profiles (birds-eye views).
Therefore, if your custom data comes from a camera with different intrinsics, the output profiles might be skewed.
# Plot outputs
python scripts/images/gen_img_custom.py --img media/example/0000.png --model /out/kitti_360/<model-name> --plot
# Save outputs to disk
python scripts/images/gen_img_custom.py --img media/example/0000.png --model --model /out/kitti_360/<model-name>
We provide not only a way to evaluate our method (S4C) on the SSCBench KITTI-360 dataset,
but also a way to easily evaluate/compare other methods. For this, you only need the predictions on the test set
(sequence 09) saved as frame_id.npy
files in a folder. \
In addition, we provide the predictions for LMSCNet,
SSCNet and MonoScene.
To evaluate our model on the SSCBench KITTI-360 dataset, we need additional data:
We require the SSCBench KITTI-360 dataset, which can be downloaded from here. The folder structure you will have to link to looks like:
calibration data_2d_raw preprocess
We also need preprocessed ground truth (voxelized ground truth) that belongs to the KITTI-360 SSCBench data. The preprocessed data for KITTI-360 in the GitHub Repo was incorrectly generated (see here).\ Thus, we provide our validated preprocessed ground truth for download here.
The folder structure you will have to link to looks like:
2013_05_28_drive_0000_sync 2013_05_28_drive_0002_sync 2013_05_28_drive_0004_sync 2013_05_28_drive_0006_sync 2013_05_28_drive_0009_sync
2013_05_28_drive_0001_sync 2013_05_28_drive_0003_sync 2013_05_28_drive_0005_sync 2013_05_28_drive_0007_sync 2013_05_28_drive_0010_sync
You can now run the evaluation script found at scripts/benchmarks/sscbench/evaluate_model_sscbench.py
by running:
python evaluate_model_sscbench.py \
-ssc <path-to-kitti_360-sscbench-dataset> \
-vgt <path-to-preprocessed-voxel-ground-truth> \
-cp <path-to-model-checkpoint> \
-f
You can download the predictions for LMSCNet, SSCNet and MonoScene here:
You can now run the evaluation script found at scripts/benchmarks/sscbench/evaluate_saved_outputs.py
by running:
python evaluate_saved_outputs.py
-t
<path-to-preprocessed-voxel-ground-truth>/2013_05_28_drive_0009_sync
-o
<path-to-saved-outputs>
Note that both the -o
and -t
should point to folders that are filled with files of the form <frame_id>.npy
.
Coming soon.
This work was supported by the ERC Advanced Grant SIMULACRON, the GNI project AI4Twinning and the Munich Center for Machine Learning. C. R. is supported by VisualAI EP/T028572/1 and ERC-UNION-CoG-101001212.
This repository is based on the BehindTheScenes. We evaluate our models on the novel SSCBench KITTI-360 benchmark. We generate our pseudo 2D segmentation ground truth using the Panoptic Deeplab model zoo.
Here you can find the answers to commonly asked questions.
Q: What semantic labels are you using?
A: To generate the Pseudo-Ground-Truth Segmentation masks we use models from the Panoptic Deeplab model zoo that predict the cityscapes labels. To evaluate our method on the SSCBench KITTI-360 benchmark, we have mapped the sematic classes from cityscapes and KITTI-360 (SSCBench) to a unified label standard. Most classes have an one-to-one correspondence. The detailed mapping can be found (and edited) in label_maps.yaml.