CatVersion: Concatenating Embeddings for Diffusion-Based Text-to-Image Personalization

teaser

Ruoyu Zhao¹, Mingrui Zhu¹, Shiyin Dong¹, Nannan Wang¹, Xinbo Gao²
¹Xidian University, ²Chongqing University of Posts and Telecommunications

Abstract:
We propose CatVersion, an inversion-based method that learns the personalized concept through a handful of examples. Subsequently, users can utilize text prompts to generate images that embody the personalized concept, thereby achieving text-to-image personalization. In contrast to existing methods that emphasize word embedding learning or parameter fine-tuning, which potentially causes concept dilution or overfitting, our method concatenates embeddings on the feature-dense space of the text encoder in the diffusion model to learn the gap between the personalized concept and its base class, aiming to maximize the preservation of prior knowledge in diffusion models while restoring the personalized concepts. To this end, we first dissect the text encoder's integration in the image generation process to identify the feature-dense space. Afterward, we concatenate embeddings on the Keys and Values in this space to learn the gap between the personalized concept and its base class. In this way, the concatenated embeddings ultimately manifest as a residual on the original attention output. To more accurately and unbiasedly quantify the results, we improve the CLIP image alignment score based on masks. Qualitatively and quantitatively, CatVersion helps to restore personalization concepts more faithfully and enables more robust editing.

Description

This is the official repository of the paper CatVersion: Concatenating Embeddings for Diffusion-Based Text-to-Image Personalization

Updates

01/01/2024 Code released! 🐣🐣🐣

Getting Started 🧨🧨🧨

Preparation

Pre-trained model: This code implementation is based on SD v1.5. Please put it to ''./models/sd/~.ckpt''
Environment: Please reference environment.yaml

Test 🚀

Please run

sh run.sh

or go to ./prompt-to-prompt_sd_attention_map.ipynb

Train 🔥

Initialization: Base class (trigger) word initialization in "ldm/data/personalized.py"

Run:

python main.py --base configs/stable-diffusion/v1-finetune.yaml \
           -t \
           --actual_resume models/sd/v1-5.ckpt \
           -n cat \
           --gpus 0, \
           --data_root your/dataset/root \

Citation

If you use this code or ideas from our paper, please cite our paper:

@misc{zhao2023catversion,
        title={CatVersion: Concatenating Embeddings for Diffusion-Based Text-to-Image Personalization}, 
        author={Ruoyu Zhao and Mingrui Zhu and Shiyin Dong and Nannan Wang and Xinbo Gao},
        year={2023},
        eprint={2311.14631},
        archivePrefix={arXiv},
        primaryClass={cs.CV}
  }

Acknowledgments

This code borrows from Textual Inversion, Transformers. Some snippets of colab code from prompt-to-prompt. Thanks to these open-source contributions! 👼

RoyZhao926 / CatVersion

readme