Cascade-Zero123: One Image to Highly Consistent 3D with Self-Prompted Nearby Views (ECCV 2024)

Project Page | Arxiv Paper | Video

block
Cascade-Zero123 progressively extracts the 3D information from one single image via self-prompted nearby views. View-consistent images can be generated by constructing the structure in a cascade manner.

block Cascade-Zero123 can be divided into two parts. The left part is Base-0123, which takes a set of R and T values as input to generate corresponding multi-view images. These output images are concatenated with the input condition image and its corresponding camera pose, forming a self-prompted input denoted as a set of c(xc, ∆R, ∆T) for the right part Refiner-0123.

🦾 Updates

7/2/2024: Accepted by ECCV 2024.
10/16/2023: The rough code has been released, and there may still be some issues. Please feel free to raise issues.

Requirements

Pytorch 2.0 for faster training and inference.

conda create -f environment.yml

conda create -n cascade-zero123 python=3.9
conda activate cascade-zero123
pip install -r requirements.txt

Install xformer properly to enable efficient transformers.

conda install xformers -c xformers
# from source
pip install -v -U git+https://github.com/facebookresearch/xformers.git@main#egg=xformers


## Data Preparation
Download Zero123's Objaverse Renderings data:
```commandline
wget https://tri-ml-public.s3.amazonaws.com/datasets/views_release.tar.gz

Configure accelerator by

accelerate config

Training

Launch training:

Follow Original Zero123, fp32, gradient checkpointing, and EMA are turned on.

accelerate launch train_cascade0123.py \
--train_data_dir /data/zero123/views_release \
--pretrained_model_name_or_path lambdalabs/sd-image-variations-diffusers \
--train_batch_size 192 \
--dataloader_num_workers 16 \
--output_dir logs \
--use_ema \
--gradient_checkpointing \
--mixed_precision no

While bf16/fp16 is also supported by running below

accelerate launch train_cascade0123.py \
--train_data_dir /data/zero123/views_release \
--pretrained_model_name_or_path lambdalabs/sd-image-variations-diffusers \
--train_batch_size 192 \
--dataloader_num_workers 16 \
--output_dir logs \
--use_ema \
--gradient_checkpointing \
--mixed_precision bf16

For monitoring training progress, we recommand wandb for its simplicity and powerful features.

wandb login

Acknowledgement

This repository is based on original Zero-1-to-3 and its diffuser implementation zero123-hf. Thanks for their awesome works.

Citation

If you find this work repository/work helpful in your research, welcome to cite the paper and give a ⭐:

@article{Cascadezero123,
  author = {Yabo Chen, Jiemin Fang, Yuyang Huang, Taoran Yi, Xiaopeng Zhang, Lingxi Xie, Xinggang Wang, Wenrui Dai, Hongkai Xiong,and Qi Tian},
  title = {Cascade-Zero123: One Image to Highly Consistent 3D with Self-Prompted Nearby Views},
  year = {2023},
  journal={arXiv preprint arXiv:2312.04424}

On Coming

[x] Scripts of convert diffusers back to zero123 format
[ ] Releasing the checkpoints
[ ] Novel View Synthesis testing code
[ ] Single Image to 3D testing code

AbrahamYabo / Cascade-Zero123

readme