This work presents a versatile image-to-image visual assistant, PixWizard, designed for image generation, manipulation, and translation based on free-from user instructions. [π Paper]
Resolution | PixWizard Parameter | Text Encoder | VAE Encoder | Prediction | Download URL |
---|---|---|---|---|---|
512-768-1024 | 2B | Gemma-2B and CLIP-L-336 | SD-XL | Rectified Flow | π€hugging face |
git clone https://github.com/AFeng-x/PixWizard.git
cd PixWizard
Before installation, ensure that you have a working nvcc
# The command should work and show the same version number as in our case. (12.1 in our case).
nvcc --version
On some outdated distros (e.g., CentOS 7), you may also want to check that a late enough version of
gcc
is available
# The command should work and show a version of at least 6.0.
# If not, consult distro-specific tutorials to obtain a newer version or build manually.
gcc --version
# Create a new conda environment named 'PixWizard
conda create -n PixWizard -y
# Activate the 'sphinx-v' environment
conda activate PixWizard
# Install python and pytorch
conda install python=3.11 pytorch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 pytorch-cuda=12.1 -c pytorch -c nvidia -y
# Install required packages from 'requirements.txt'
pip install -r requirements.txt
# Install Flash-Attention
pip install flash-attn --no-build-isolation
run the following command:
bash exps/inference_pixwizard.sh
Prepare data
Run training
If you find our project useful for your research and applications, please kindly cite using this BibTeX:
@article{lin2024pixwizard,
title={PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language Instructions},
author={Lin, Weifeng and Wei, Xinyu and Zhang, Renrui and Zhuo, Le and Zhao, Shitian and Huang, Siyuan and Xie, Junlin and Qiao, Yu and Gao, Peng and Li, Hongsheng},
journal={arXiv preprint arXiv:2409.15278},
year={2024}
}