Comfy-SVDTools

A collection of techniques that extend the functionality of Stable Video Diffusion in ComfyUI. Most of these were investigated for the purpose of extending context length; though they may be useful for other purposes as well. In certain cases you can generate videos up to four times the original trained context length of the model; though this will require some experimentation.

I've divided the functionality into two nodes: SVDToolsPatcher and SVDToolsPatcherExperimental. Techniques in SVDToolsPatcher are marked as 'Experimental' below and may change or be removed in the future. Techniques in SVDToolsPatcher tend to give good results, and probably won't change.

Examples

Baseline (48 frames, with SVD - originally trained for 12 frames):

https://github.com/brianfitzgerald/svd_extender/assets/2797445/18ca3513-cf12-4598-84d3-00a3f5eda682

48 frames, with timestep scaled to 12 frames:

https://github.com/brianfitzgerald/svd_extender/assets/2797445/ebccc0a3-f071-40f7-9a10-62bf0118487c

48 frames, with timestep scaled to 12 frames, and attn_k_scale of 0.7:

https://github.com/brianfitzgerald/svd_extender/assets/2797445/284e5ef7-ea30-4e47-9094-b51082f31867

Techniques

Position Embedding Scaling

Similar to YaRN for language models, this technique scales the position embeddings in the SpatialVideoTransformer layers to match a set embedding length. For example, if position_embedding_frames is set to 12, but the batch size is 42, the model will generate video with 42 frames, but the position embeddings will be scaled to 12 frames. This allows the model to generate video with a longer context length than the position embeddings would normally allow.

Settings

scale_timestep_embedding: Enable / disable position embedding scaling.
position_embedding_frames: The number of frames to scale the position embeddings to. The model will be conditioned as if it were generating video with this many frames, but will actually generate video with the number of frames in the batch.

Key Scaling

Scales the keys only for temporal attention. Consistently leads to less jittering at higher motion bucket IDs, especially with long context windows.

Settings

temporal_attn_k_scale: Higher leads to more movement, lower leads to less movement. A value of 1.0 is the same as the default attention scaling.

Attention Windowing (Experimental)

Following the FreeNoise paper, this technique uses a windowed attention mechanism to only compute cross-attention in each temporal layer for a subset of the total latents.

Settings

attn_window_size: The size of the window to use for attention. This is the number of latents to attend to in each layer.
attn_window_stride: The stride of the window. This is the number of latents to skip between each window, i.e. a stride of 6 with a window size of 12 will attend to latents 0-11, 6-17, 12-23, etc.
shuffle_windowed_noise: Shuffles the initial batch of latents. This is a technique mentioned in the FreeNoise paper, and can sometimes help with inter-batch stability.

Temporal Attention Scale (Experimental)

An implementation of Jonathan Fischoff's technique for scaling the attention in each temporal layer. This scales the self attention values by sqrt(scale/dim_head).

Settings

temporal_attn_scale: Higher leads to more movement, lower leads to less movement. A value of 1.0 is the same as the default attention scaling.

How to Use

Simply download or git clone this repository in ComfyUI/custom_nodes. An example pipeline is provided in the resources folder in this repo.

Limitations

xformers must be installed; this is temporary, until the scale parameter is added to the self-attention nodes in ComfyUI.
The SVDToolsPatcher nodes override the Comfy comfy.sample.sample function, in order to unpatch the forward method of SpatialVideoTransformer. This may cause issues with other custom sample nodes. This is done as there's no way to patch the forward method of SpatialVideoTransformer using ModelPatcher - if this is added to Comfy in the future, this override will be removed.

Up Next

Techniques I'm either currently working on implementing or plan to implement in the future:

[ ] FreeInit
[ ] Motion transfer, following the FreeNoise implementation
[ ] Looping mode (overlap the first and last windows)
[ ] Text conditioning interpolation / blending

brianfitzgerald / Comfy-SVDTools

readme

Comfy-SVDTools

Examples

Techniques

Position Embedding Scaling

Settings

Key Scaling

Settings

Attention Windowing (Experimental)

Settings

Temporal Attention Scale (Experimental)

Settings

How to Use

Limitations

Up Next