A collection of techniques that extend the functionality of Stable Video Diffusion in ComfyUI. Most of these were investigated for the purpose of extending context length; though they may be useful for other purposes as well. In certain cases you can generate videos up to four times the original trained context length of the model; though this will require some experimentation.
I've divided the functionality into two nodes: SVDToolsPatcher
and SVDToolsPatcherExperimental
. Techniques in SVDToolsPatcher
are marked as 'Experimental' below and may change or be removed in the future. Techniques in SVDToolsPatcher
tend to give good results, and probably won't change.
Baseline (48 frames, with SVD - originally trained for 12 frames):
https://github.com/brianfitzgerald/svd_extender/assets/2797445/18ca3513-cf12-4598-84d3-00a3f5eda682
48 frames, with timestep scaled to 12 frames:
https://github.com/brianfitzgerald/svd_extender/assets/2797445/ebccc0a3-f071-40f7-9a10-62bf0118487c
48 frames, with timestep scaled to 12 frames, and attn_k_scale of 0.7:
https://github.com/brianfitzgerald/svd_extender/assets/2797445/284e5ef7-ea30-4e47-9094-b51082f31867
Similar to YaRN for language models, this technique scales the position embeddings in the SpatialVideoTransformer
layers to match a set embedding length. For example, if position_embedding_frames
is set to 12, but the batch size is 42, the model will generate video with 42 frames, but the position embeddings will be scaled to 12 frames. This allows the model to generate video with a longer context length than the position embeddings would normally allow.
scale_timestep_embedding
: Enable / disable position embedding scaling.position_embedding_frames
: The number of frames to scale the position embeddings to. The model will be conditioned as if it were generating video with this many frames, but will actually generate video with the number of frames in the batch.Scales the keys only for temporal attention. Consistently leads to less jittering at higher motion bucket IDs, especially with long context windows.
temporal_attn_k_scale
: Higher leads to more movement, lower leads to less movement. A value of 1.0 is the same as the default attention scaling.Following the FreeNoise paper, this technique uses a windowed attention mechanism to only compute cross-attention in each temporal layer for a subset of the total latents.
attn_window_size
: The size of the window to use for attention. This is the number of latents to attend to in each layer.attn_window_stride
: The stride of the window. This is the number of latents to skip between each window, i.e. a stride of 6 with a window size of 12 will attend to latents 0-11, 6-17, 12-23, etc.shuffle_windowed_noise
: Shuffles the initial batch of latents. This is a technique mentioned in the FreeNoise paper, and can sometimes help with inter-batch stability.An implementation of Jonathan Fischoff's technique for scaling the attention in each temporal layer. This scales the self attention values by sqrt(scale/dim_head)
.
temporal_attn_scale
: Higher leads to more movement, lower leads to less movement. A value of 1.0 is the same as the default attention scaling.Simply download or git clone this repository in ComfyUI/custom_nodes
. An example pipeline is provided in the resources
folder in this repo.
xformers
must be installed; this is temporary, until the scale
parameter is added to the self-attention nodes in ComfyUI.SVDToolsPatcher
nodes override the Comfy comfy.sample.sample
function, in order to unpatch the forward
method of SpatialVideoTransformer
. This may cause issues with other custom sample nodes. This is done as there's no way to patch the forward
method of SpatialVideoTransformer
using ModelPatcher
- if this is added to Comfy in the future, this override will be removed.Techniques I'm either currently working on implementing or plan to implement in the future: