Yuedong Chen
·
Chuanxia Zheng
·
Haofei Xu
·
Bohan Zhuang
Andrea Vedaldi
·
Tat-Jen Cham
·
Jianfei Cai
https://github.com/user-attachments/assets/4cfa6654-5bb5-4f72-a264-6941bcf00bed
To get started, create a conda virtual environment using Python 3.10+ and install the requirements:
conda create -n mvsplat360 python=3.10
conda activate mvsplat360
pip install torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 xformers==0.0.25.post1 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt
This project mainly uses DL3DV and RealEstate10K datasets.
The dataset structure aligns with our previous work, MVSplat. You may refer to the script convert_dl3dv.py for converting the DL3DV-10K datasets to the torch chunks used in this project.
You might also want to check out the DepthSplat's DATASETS.md, which provides detailed instructions on pre-processing DL3DV and RealEstate10K for use here (as both projects share the same code base from pixelSplat).
A pre-processed tiny subset of DL3DV (containing 5 scenes) is provided here for quick reference. To use it, simply download it and unzip it to datasets/dl3dv_tiny
.
To render novel views,
get the pre-trained models dl3dv_480p.ckpt, and save it to /checkpoints
run the following:
# dl3dv; requires at least 22G VRAM
python -m src.main +experiment=dl3dv_mvsplat360 \
wandb.name=dl3dv_480P_ctx5_tgt56 \
mode=test \
dataset/view_sampler=evaluation \
dataset.roots=[datasets/dl3dv_tiny] \
checkpointing.load=checkpoints/dl3dv_480p.ckpt
outputs/test/{wandb.name}
To evaluate the quantitative performance, kindly refer to compute_dl3dv_metrics.py
To render videos from a pre-trained model, run the following
# dl3dv; requires at least 38G VRAM
python -m src.main +experiment=dl3dv_mvsplat360_video \
wandb.name=dl3dv_480P_ctx5_tgt56_video \
mode=test \
dataset/view_sampler=evaluation \
dataset.roots=[datasets/dl3dv_tiny] \
checkpointing.load=checkpoints/dl3dv_480p.ckpt
checkpoints/re10k.ckpt
.checkpoints/svd.safetensors
.# train mvsplat360; requires at least 80G VRAM
python -m src.main +experiment=dl3dv_mvsplat360 dataset.roots=[datasets/dl3dv]
checkpointing.load=checkpoints/dl3dv_480p.ckpt
and checkpointing.resume=false
to the above command. The camera intrinsic matrices are normalized (the first row is divided by image width, and the second row is divided by image height). More details are at this comment.
The camera extrinsic matrices are OpenCV-style camera-to-world matrices (+X right, +Y down, +Z camera looks into the screen).
@article{chen2024mvsplat360,
title = {MVSplat360: Feed-Forward 360 Scene Synthesis from Sparse Views},
author = {Chen, Yuedong and Zheng, Chuanxia and Xu, Haofei and Zhuang, Bohan and Vedaldi, Andrea and Cham, Tat-Jen and Cai, Jianfei},
booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
year = {2024},
}
The project is based on MVSplat, pixelSplat, UniMatch and generative-models. Many thanks to these projects for their excellent contributions!