donydchen / mvsplat360

🎞️ [NeurIPS'24] MVSplat360: Feed-Forward 360 Scene Synthesis from Sparse Views
https://donydchen.github.io/mvsplat360/
MIT License
120 stars 2 forks source link
feed-forward-gaussian-splatting gaussian-splatting generative-models neurips-2024 novel-view-synthesis stable-video-diffusion video-diffusion-model

MVSplat360: Feed-Forward 360 Scene Synthesis
from Sparse Views

Yuedong Chen  ·  Chuanxia Zheng  ·  Haofei Xu  ·  Bohan Zhuang
Andrea Vedaldi  ·  Tat-Jen Cham  ·  Jianfei Cai

NeurIPS 2024

Paper | Project Page | Pretrained Models

https://github.com/user-attachments/assets/4cfa6654-5bb5-4f72-a264-6941bcf00bed

Installation

To get started, create a conda virtual environment using Python 3.10+ and install the requirements:

conda create -n mvsplat360 python=3.10
conda activate mvsplat360
pip install torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 xformers==0.0.25.post1 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt

Acquiring Datasets

This project mainly uses DL3DV and RealEstate10K datasets.

The dataset structure aligns with our previous work, MVSplat. You may refer to the script convert_dl3dv.py for converting the DL3DV-10K datasets to the torch chunks used in this project.

You might also want to check out the DepthSplat's DATASETS.md, which provides detailed instructions on pre-processing DL3DV and RealEstate10K for use here (as both projects share the same code base from pixelSplat).

A pre-processed tiny subset of DL3DV (containing 5 scenes) is provided here for quick reference. To use it, simply download it and unzip it to datasets/dl3dv_tiny.

Running the Code

Evaluation

To render novel views,

# dl3dv; requires at least 22G VRAM
python -m src.main +experiment=dl3dv_mvsplat360 \
wandb.name=dl3dv_480P_ctx5_tgt56 \
mode=test \
dataset/view_sampler=evaluation \
dataset.roots=[datasets/dl3dv_tiny] \
checkpointing.load=checkpoints/dl3dv_480p.ckpt

To evaluate the quantitative performance, kindly refer to compute_dl3dv_metrics.py

To render videos from a pre-trained model, run the following

# dl3dv; requires at least 38G VRAM
python -m src.main +experiment=dl3dv_mvsplat360_video \
wandb.name=dl3dv_480P_ctx5_tgt56_video \
mode=test \
dataset/view_sampler=evaluation \
dataset.roots=[datasets/dl3dv_tiny] \
checkpointing.load=checkpoints/dl3dv_480p.ckpt 

Training

# train mvsplat360; requires at least 80G VRAM
python -m src.main +experiment=dl3dv_mvsplat360 dataset.roots=[datasets/dl3dv]

Camera Conventions

The camera intrinsic matrices are normalized (the first row is divided by image width, and the second row is divided by image height). More details are at this comment.

The camera extrinsic matrices are OpenCV-style camera-to-world matrices (+X right, +Y down, +Z camera looks into the screen).

BibTeX

@article{chen2024mvsplat360,
    title     = {MVSplat360: Feed-Forward 360 Scene Synthesis from Sparse Views},
    author    = {Chen, Yuedong and Zheng, Chuanxia and Xu, Haofei and Zhuang, Bohan and Vedaldi, Andrea and Cham, Tat-Jen and Cai, Jianfei},
    booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
    year      = {2024},
}

Acknowledgements

The project is based on MVSplat, pixelSplat, UniMatch and generative-models. Many thanks to these projects for their excellent contributions!