StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation

Aidenzich commented 6 months ago

Symbol	Description	Dimensions
$B$	Batch size	Number of images in a batch
$N$	Number of tokens (features) in each image	Number of tokens per image
$C$	Channel number	Number of channels per token
$\mathbf{I} \in \mathbb{R}^{B \times N \times C}$	Batch of image features	Tensor of shape $(B, N, C)$
$S_i$	Randomly sampled tokens from other images in the batch	Subset of tokens from batch

An interesting paper that provides an intuitive understanding of how adjusting the model architecture can alter the physical meaning of the output.

Aidenzich commented 6 months ago

Motivation and Problem Statement from the Paper

The paper addresses the challenges and limitations of existing diffusion-based generative models, specifically focusing on generating consistent content across a series of images or videos. Here's a detailed look at the motivation and the problems the paper aims to solve:

Motivation

Self-Attention and Consistency:
- Self-attention is critical for modeling the structure of generated visual content. The main motivation is to use reference images to guide self-attention calculations, significantly improving consistency between generated images without requiring model training or fine-tuning. This idea leads to the proposal of Consistent Self-Attention, which can be inserted into diffusion models to replace the original self-attention in a zero-shot manner.
Limitations of Existing Methods:
- Existing methods like IP-Adapter and identity preservation techniques (InstantID, PhotoMaker) have their limitations. IP-Adapter reduces text controllability due to strong guidance from reference images. Identity preservation methods focus on maintaining identity but often fail to ensure attire and scenario consistency. This motivates the need for a method that can maintain both identity and attire consistency while maximizing text controllability.
Lightweight and Zero-Shot Solutions:
- The paper seeks to develop a lightweight solution with minimal data and computational cost, ideally operating in a zero-shot manner. This approach contrasts with traditional methods requiring extensive computational resources and data for training temporal modules.

Problems to Solve

Subject Consistency in Generated Images and Videos:
- The primary challenge is generating images and videos with consistent characters in terms of both identity and attire. This consistency is crucial for storytelling applications where the same character appears across multiple scenes or frames.
Maintaining Text Controllability:
- Ensuring that generated content adheres closely to text prompts while maintaining visual consistency is another significant problem. The method must allow for high text controllability without compromising the visual coherence of the generated content.
Efficient Generation:
- Developing a method that can efficiently generate long image sequences or videos with consistent subjects, capable of handling large movements and transitions smoothly, is essential. This involves predicting transitions in semantic spaces rather than just image latent spaces to achieve more stable results.

Proposed Solution

The paper proposes the following methods to address these challenges:

Consistent Self-Attention:
- A training-free, pluggable attention module designed to maintain character consistency across a sequence of images. It incorporates sampled reference tokens from other images in the batch, ensuring no extra training is required.
Semantic Motion Predictor:
- A novel motion prediction module that predicts transitions between two images in the semantic space, generating stable long-range video frames that can handle significant character movements better than existing methods.
StoryDiffusion Framework:
- Combining Consistent Self-Attention and Semantic Motion Predictor to generate long image sequences or videos based on text prompts, ensuring high consistency and smooth transitions.

Aidenzich commented 6 months ago

Source of Sample Tokens

The sample tokens in the Consistent Self-Attention mechanism are taken from other images within the same batch, not generated anew. Here’s a detailed explanation based on the content of the provided PDF:

Sampling from Batch:
- The Consistent Self-Attention mechanism samples tokens $S_i$ from other image features in the batch. This is done to ensure subject consistency across images within the batch.
- The formula used for sampling tokens is: $S_i = \text{RandSample}(I_1, I2, \ldots, I{i-1}, I{i+1}, \ldots, I{B-1}, I_B)$
- where $\text{RandSample}$ denotes the random sampling function and $I$ represents the image features in the batch.
Process Explanation:
- For each image feature $I_i$ in the batch, tokens are randomly sampled from the features of other images in the batch.
- After sampling, these tokens $S_i$ are paired with the image feature $I_i$ to form a new set of tokens $P_i$.
- Linear projections are then performed on $Pi$ to generate new key $K{Pi}$ and value $V{P_i}$ matrices for the Consistent Self-Attention.
- The original query $Q_i$ remains unchanged, and the self-attention is computed as:
```
O_i = \text{Attention}(Q_i, K_{P_i}, V_{P_i})
```
Maintaining Consistency:
- This method facilitates interactions among the features of different images, promoting consistency of characters, faces, and attires during the generation process.
- The process is training-free and can efficiently generate subject-consistent images without additional training.

Aidenzich / road-to-master