How might EasyAnimate slice a 1080p video? Or more specifically what is the frame interval of which the slicing happens? Assuming this is the memory requirements for resolutions lower than 1080p.
Est: 144 Frame 1920x1080: 64-80GB?
Is it possible to further lower the memory usage of the model? What is the bottleneck here? The VAE? The DIT? Can we quantize them?
Is it possible to run the model on multiple GPUs? Have you guys implemented something like device_map from accelerate to do model parallelism?
In Open-Sora-Plan v1.1 technical report they had to reduce the number of 3DConv to handle longer videos during DIT training? Meaning they had to also train the encoder but not the decoder? Why EasyAnimate doesn't need to unfreeze the encoder and can still train normally?
CV-VAE uses SD2.1's VAE which has z=4 latent, and they are encountering losing specific details. They plan to train the SD3's VAE which has z=16, to solve this problem? Does EasyAnimate suffer the same problem? How does it solve this?
For Video Captioning, what about using dense captions? For example the ShareCaptioner model does a very good job on dense video captioning. Assuming Adaln is only viable for a set of classes, but you are using cross-attention to condition the data, shouldn’t dense captions help in this case?
Also because the VAE is slicing the video frames to encode them, is it possible to do frame interpolation? Image to video works, is it also possible to do middle/end frame extension? Or even connecting different videos?
For context, I want to train or use parts of the architecture to train on animation data
First, we recommend reading our paper on arXiv, especially the 'Slice VAE' section.
Our current higher priority is to improve the quality of our generated videos: consistency, action continuity, prompt control, etc. The efficiency of inference will be developed afterwards.
We haven't tried model parallelism ourselves, but it is definitely feasible.
We first trained our own Slice VAE.
We have tried increasing the dimension of the channels, but the number of parameters increases significantly; this might be a way to improve performance.
An accurate and comprehensive Dense Captioner is very important, we experimented with many, and eventually trained our own model.
This task should be feasible, but we haven't trained it yet.
How might EasyAnimate slice a 1080p video? Or more specifically what is the frame interval of which the slicing happens? Assuming this is the memory requirements for resolutions lower than 1080p.
Est: 144 Frame 1920x1080: 64-80GB?
For context, I want to train or use parts of the architecture to train on animation data