VAE adaptability to quantization is amazing!

Not an issue, just wanted to let you guys know that your VAE takes very well to quantization. Post-encoding, I quantized each channel to one of 16 (4-bit) (or even 8 (3-bit), but a bit more visual loss) values, then merge them into a single 16-bit value. That makes the video shape [time, height width] instead of [channels (4), time, height, width] (so 4x smaller). The result is nearly the same despite that.

This allows for doing things like training LLMs on the compressed representations (can fit 8fps, 3s, 128x128 videos into 2K tokens) for video understanding and generation.

https://github.com/user-attachments/assets/e2655584-0b18-48fe-895a-263967409463

https://github.com/user-attachments/assets/b1b2146e-811f-40d4-adf0-5cc511fbf7fb

https://github.com/user-attachments/assets/b4c94d84-feaf-447f-aaa9-9ecc2aeefe63

PKU-YuanGroup / Open-Sora-Plan

VAE adaptability to quantization is amazing! #337