HiFA: High-fidelity Text-to-3D Generation with Advanced Diffusion Guidance
conda create -n hifa python=3.9
pip install -r requirements.txt
(Suppose your are using torch2.0 + cu117, install torch-scatter:)
pip install torch-scatter -f https://data.pyg.org/whl/torch-2.0.0+cu117.html
make sure torch is with cuda.is_available()
By default, we use load
to build the extension at runtime.
We also provide the setup.py
to build each extension:
# install all extension modules
bash scripts/install_ext.sh
# if you want to install manually, here is an example:
pip install ./raymarching # install to python path (you still need the raymarching/ folder, since this only installs the built extension.)
This is a lot more convenient, since you wont have to rebuild it
CUDA_VISIBLE_DEVICES=0 python main.py --text "a baby bunny sitting on top of a stack of pancakes" --workspace trials_throne_sanity --dir_text --albedo --phi_range 0 120
For both of those, you need to generate some predicted views following instruction in SyncDreamer by first removing the background and then generating 16 views. Copy over the output 0.png to this project's folder and specify the file with image-path. We provided some example images under raw_input and gt_images.
After you get the predicted views from SyncDreamer, for image to 3d generation:
CUDA_VISIBLE_DEVICES=0 python main.py --text "A toy grabber with dinosaur head" --learned_embeds_path "gt_images/dinosaur/learned_embeds.bin" --image_path "gt_images/dinosaur/0.png" --workspace "trials_dinosaur(textprompt)_imgto3d" --dir_text --albedo --gt_image_rate 0.5 --h 256 --w 256
For image-guided 3d generation:
CUDA_VISIBLE_DEVICES=0 python main.py --text "A toy grabber with dinosaur head" --learned_embeds_path "gt_images/dinosaur/learned_embeds.bin" --image_path "gt_images/dinosaur/0.png" --workspace "trials_dinosaur(textprompt)_imgguided" --dir_text --albedo --gt_image_rate 0.5 --h 256 --w 256 --anneal_gt 0.7
To use textual inversion, first compute token:
python textual-inversion/textual_inversion.py --output_dir="gt_images/teapot" --train_data_dir="raw_input/no_bg/teapot" --initializer_token="teapot" --placeholder_token="_teapot_placeholder_" --pretrained_model_name_or_path="SG161222/Realistic_Vision_V5.1_noVAE" --learnable_property="object" --resolution=256 --train_batch_size=1 --gradient_accumulation_steps=4 --max_train_steps=5000 --learning_rate=5.0e-4 --scale_lr --lr_scheduler="constant" --lr_warmup_steps=0 --use_augmentations
python main.py --text "a DSLR photo of <token>" --learned_embeds_path "gt_images/teapot/learned_embeds.bin" --image_path "gt_images/teapot/0.png" --workspace "trials_teapot_gtrate=0.5_v9" --dir_text --albedo --gt_image_rate 0.5 --h 256 --w 256
However, we don't find that textual inversion visibly brings a benefit
Some notable additions compared to the paper:
@misc{zhu2023hifa,
title={HiFA: High-fidelity Text-to-3D Generation with Advanced Diffusion Guidance},
author={Junzhe Zhu and Peiye Zhuang},
year={2023},
eprint={2305.18766},
archivePrefix={arXiv},
primaryClass={cs.CV}
}