Open cocktailpeanut opened 1 week ago
The peak is in the VAE part, not the transformer. The transformer part usually consumes 34G of video memory, while the peak of the VAE can reach 68G (1360 * 720)
I'd imagine something similar to the mochi spatial tiled vae should be possible no?
yes, I will make it in diffusers version next week and it will use tilling vae / sclicing VAE and model.cpu_offload
Is CogVideoX 1.5 a different architecture then 1.0? If not, can’t it directly convert to diffusers?
Feature request / 功能建议
I think it would be helpful to have some info on what to expect when running inference with 1.5. Especially with I2V, since it's supposed to generate at any resolution, would be nice to have some stats on how much VRAM is required and how long it takes to generate the videos at each resolution.
Motivation / 动机
Would like to run this on a local PC and trying to understand how feasible it is compared to the previous optimized version (which ran very well on low end machines)
Your contribution / 您的贡献
Not applicable