THUDM / CogVideo

text and image to video generation: CogVideoX (2024) and CogVideo (ICLR 2023)
Apache License 2.0
9.23k stars 864 forks source link

VRAM requirement + Generation Speed for CogVideoX 1.5? #471

Open cocktailpeanut opened 1 week ago

cocktailpeanut commented 1 week ago

Feature request / 功能建议

I think it would be helpful to have some info on what to expect when running inference with 1.5. Especially with I2V, since it's supposed to generate at any resolution, would be nice to have some stats on how much VRAM is required and how long it takes to generate the videos at each resolution.

Motivation / 动机

Would like to run this on a local PC and trying to understand how feasible it is compared to the previous optimized version (which ran very well on low end machines)

Your contribution / 您的贡献

Not applicable

zRzRzRzRzRzRzR commented 1 week ago

The peak is in the VAE part, not the transformer. The transformer part usually consumes 34G of video memory, while the peak of the VAE can reach 68G (1360 * 720)

cchance27 commented 1 week ago

I'd imagine something similar to the mochi spatial tiled vae should be possible no?

zRzRzRzRzRzRzR commented 1 week ago

yes, I will make it in diffusers version next week and it will use tilling vae / sclicing VAE and model.cpu_offload

johnwick123f commented 1 week ago

Is CogVideoX 1.5 a different architecture then 1.0? If not, can’t it directly convert to diffusers?