NVlabs / consistory

Other
260 stars 18 forks source link

ConsiStory: Training-Free Consistent Text-to-Image Generation [SIGGRAPH 2024]

arXiv

[Project Website] [Consistory NVIDIA NIM]

ConsiStory: Training-Free Consistent Text-to-Image Generation
Yoad Tewel1,2, Omri Kaduri3, Rinon Gal1,2, Yoni Kasten1, Lior Wolf2, Gal Chechik1, Yuval Atzmon1
1NVIDIA, 2Tel Aviv University, 3Independent

Abstract:
Text-to-image models offer a new level of creative flexibility by allowing users to guide the image generation process through natural language. However, using these models to consistently portray the same subject across diverse prompts remains challenging. Existing approaches fine-tune the model to teach it new words that describe specific user-provided subjects or add image conditioning to the model. These methods require lengthy per-subject optimization or large-scale pre-training. Moreover, they struggle to align generated images with text prompts and face difficulties in portraying multiple subjects. Here, we present ConsiStory, a training-free approach that enables consistent subject generation by sharing the internal activations of the pretrained model. We introduce a subject-driven shared attention block and correspondence-based feature injection to promote subject consistency between images. Additionally, we develop strategies to encourage layout diversity while maintaining subject consistency. We compare ConsiStory to a range of baselines, and demonstrate state-of-the-art performance on subject consistency and text alignment, without requiring a single optimization step. Finally, ConsiStory can naturally extend to multi-subject scenarios, and even enable training-free personalization for common objects.

Description

This repo contains the official code for our Consistory paper.

TODO:

Setup

To set up our environment, please run:

conda env create --file environment.yml

Usage

Run from Command Line

This command-line interface (CLI) allows you to generate batches of images with a consistent subject in different settings for each image prompt. The model offers two run modes: batch and cached.

Basic Usage

To generate images, run the following commands with the desired parameters:

python consistory_CLI.py --subject "a cute dog" --concept_token "dog" --settings "sitting in the beach" "in the circus" --out_dir "out"

Parameters

Example commands

  1. Batch generation:

    python consistory_CLI.py --subject "a cute dog" --concept_token "dog" --settings "sitting in the beach" "standing in the snow" "playing in the park" --out_dir "out"

    This command generates a batch of images of "a cute dog" in three settings, "sitting in the beach", "standing in the snow" and "playing in the park", and saves the output in the out directory. Note that the first two settings will determine the subject's identity, but changing any subsequent settings will generate new images without altering the subject's appearance.

  2. Cached Anchor Generation:

    python consistory_CLI.py --run_type "cached" --subject "a cute dog" --concept_token "dog" --settings "sitting in the beach" "in the circus" "swimming in the sea" "standing on a boat" "in a pet food commercial" --out_dir "out"

    This command uses the cached mode. It generates images of "a cute dog" in the first two settings ("sitting in the beach", "in the circus") as anchor images, caching them for efficiency. Additional images are then generated in the subsequent settings, "swimming in the sea", "standing on a boat", and "in a pet food commercial".

  3. Multiple Consistent Subjects:

    python consistory_CLI.py --subject "a cute dog" --concept_token "dog" "hat" --settings "wearing a hat" "standing in the snow" "wearing a hat, sitting in the park" --out_dir "out"

    This command generates a batch of images with multiple consistent subjects, including both the dog and the hat.

Run from Jupyter Notebook

Example usage in consistory_notebook.ipynb

Tips and Tricks

Citation

If you make use of our work, please cite our paper:

@article{tewel2024training,
  title={Training-free consistent text-to-image generation},
  author={Tewel, Yoad and Kaduri, Omri and Gal, Rinon and Kasten, Yoni and Wolf, Lior and Chechik, Gal and Atzmon, Yuval},
  journal={ACM Transactions on Graphics (TOG)},
  volume={43},
  number={4},
  pages={1--18},
  year={2024},
  publisher={ACM New York, NY, USA}
}