PKU-YuanGroup / Open-Sora-Plan

This project aim to reproduce Sora (Open AI T2V model), we wish the open source community contribute to this project.
MIT License
11.22k stars 999 forks source link

VAE adaptability to quantization is amazing! #337

Open NilanEkanayake opened 1 month ago

NilanEkanayake commented 1 month ago

Not an issue, just wanted to let you guys know that your VAE takes very well to quantization. Post-encoding, I quantized each channel to one of 16 (4-bit) (or even 8 (3-bit), but a bit more visual loss) values, then merge them into a single 16-bit value. That makes the video shape [time, height width] instead of [channels (4), time, height, width] (so 4x smaller). The result is nearly the same despite that.

This allows for doing things like training LLMs on the compressed representations (can fit 8fps, 3s, 128x128 videos into 2K tokens) for video understanding and generation.

https://github.com/user-attachments/assets/e2655584-0b18-48fe-895a-263967409463

https://github.com/user-attachments/assets/b1b2146e-811f-40d4-adf0-5cc511fbf7fb

https://github.com/user-attachments/assets/b4c94d84-feaf-447f-aaa9-9ecc2aeefe63

LinB203 commented 1 month ago

Great, but I think the 8×8 compression ratio is too redundant for comprehension tasks. Maybe we need a 16×16 compression ratio.