brianfitzgerald / Comfy-SVDTools

42 stars 2 forks source link

Comfy-SVDTools

A collection of techniques that extend the functionality of Stable Video Diffusion in ComfyUI. Most of these were investigated for the purpose of extending context length; though they may be useful for other purposes as well. In certain cases you can generate videos up to four times the original trained context length of the model; though this will require some experimentation.

I've divided the functionality into two nodes: SVDToolsPatcher and SVDToolsPatcherExperimental. Techniques in SVDToolsPatcher are marked as 'Experimental' below and may change or be removed in the future. Techniques in SVDToolsPatcher tend to give good results, and probably won't change.

Examples

Baseline (48 frames, with SVD - originally trained for 12 frames):

https://github.com/brianfitzgerald/svd_extender/assets/2797445/18ca3513-cf12-4598-84d3-00a3f5eda682

48 frames, with timestep scaled to 12 frames:

https://github.com/brianfitzgerald/svd_extender/assets/2797445/ebccc0a3-f071-40f7-9a10-62bf0118487c

48 frames, with timestep scaled to 12 frames, and attn_k_scale of 0.7:

https://github.com/brianfitzgerald/svd_extender/assets/2797445/284e5ef7-ea30-4e47-9094-b51082f31867

Techniques

Position Embedding Scaling

Similar to YaRN for language models, this technique scales the position embeddings in the SpatialVideoTransformer layers to match a set embedding length. For example, if position_embedding_frames is set to 12, but the batch size is 42, the model will generate video with 42 frames, but the position embeddings will be scaled to 12 frames. This allows the model to generate video with a longer context length than the position embeddings would normally allow.

Settings

Key Scaling

Scales the keys only for temporal attention. Consistently leads to less jittering at higher motion bucket IDs, especially with long context windows.

Settings

Attention Windowing (Experimental)

Following the FreeNoise paper, this technique uses a windowed attention mechanism to only compute cross-attention in each temporal layer for a subset of the total latents.

Settings

Temporal Attention Scale (Experimental)

An implementation of Jonathan Fischoff's technique for scaling the attention in each temporal layer. This scales the self attention values by sqrt(scale/dim_head).

Settings

How to Use

Simply download or git clone this repository in ComfyUI/custom_nodes. An example pipeline is provided in the resources folder in this repo.

Limitations

Up Next

Techniques I'm either currently working on implementing or plan to implement in the future: