MVSplat360: Feed-Forward 360 Scene Synthesis
from Sparse Views

Yuedong Chen · Chuanxia Zheng · Haofei Xu · Bohan Zhuang
Andrea Vedaldi · Tat-Jen Cham · Jianfei Cai

NeurIPS 2024

Paper | Project Page | Pretrained Models

https://github.com/user-attachments/assets/4cfa6654-5bb5-4f72-a264-6941bcf00bed

Installation

To get started, create a conda virtual environment using Python 3.10+ and install the requirements:

conda create -n mvsplat360 python=3.10
conda activate mvsplat360
pip install torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 xformers==0.0.25.post1 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt

Acquiring Datasets

This project mainly uses DL3DV and RealEstate10K datasets.

The dataset structure aligns with our previous work, MVSplat. You may refer to the script convert_dl3dv.py for converting the DL3DV-10K datasets to the torch chunks used in this project.

You might also want to check out the DepthSplat's DATASETS.md, which provides detailed instructions on pre-processing DL3DV and RealEstate10K for use here (as both projects share the same code base from pixelSplat).

A pre-processed tiny subset of DL3DV (containing 5 scenes) is provided here for quick reference. To use it, simply download it and unzip it to datasets/dl3dv_tiny.

Running the Code

Evaluation

To render novel views,

get the pre-trained models dl3dv_480p.ckpt, and save it to /checkpoints
run the following:

# dl3dv; requires at least 22G VRAM
python -m src.main +experiment=dl3dv_mvsplat360 \
wandb.name=dl3dv_480P_ctx5_tgt56 \
mode=test \
dataset/view_sampler=evaluation \
dataset.roots=[datasets/dl3dv_tiny] \
checkpointing.load=checkpoints/dl3dv_480p.ckpt

the rendered novel views will be stored under outputs/test/{wandb.name}

To evaluate the quantitative performance, kindly refer to compute_dl3dv_metrics.py

To render videos from a pre-trained model, run the following

# dl3dv; requires at least 38G VRAM
python -m src.main +experiment=dl3dv_mvsplat360_video \
wandb.name=dl3dv_480P_ctx5_tgt56_video \
mode=test \
dataset/view_sampler=evaluation \
dataset.roots=[datasets/dl3dv_tiny] \
checkpointing.load=checkpoints/dl3dv_480p.ckpt

Training

Download the encoder pre-trained weight from MVSplat and save it to checkpoints/re10k.ckpt.
Download SVD pre-trained weight from generative-models and save it to checkpoints/svd.safetensors.
Run the following:

# train mvsplat360; requires at least 80G VRAM
python -m src.main +experiment=dl3dv_mvsplat360 dataset.roots=[datasets/dl3dv]

Alternatively, you can also fine-tune from our released model by appending checkpointing.load=checkpoints/dl3dv_480p.ckpt and checkpointing.resume=false to the above command.
You can also set up your wandb account here for logging. Have fun.

Camera Conventions

The camera intrinsic matrices are normalized (the first row is divided by image width, and the second row is divided by image height). More details are at this comment.

The camera extrinsic matrices are OpenCV-style camera-to-world matrices (+X right, +Y down, +Z camera looks into the screen).

BibTeX

@article{chen2024mvsplat360,
    title     = {MVSplat360: Feed-Forward 360 Scene Synthesis from Sparse Views},
    author    = {Chen, Yuedong and Zheng, Chuanxia and Xu, Haofei and Zhuang, Bohan and Vedaldi, Andrea and Cham, Tat-Jen and Cai, Jianfei},
    booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
    year      = {2024},
}

Acknowledgements

The project is based on MVSplat, pixelSplat, UniMatch and generative-models. Many thanks to these projects for their excellent contributions!

donydchen / mvsplat360

readme