Video Generation via Tokens

I did the calculation above with sberbank's vqgan, which is impressive at reconstruction but needs 1024 tokens to encode one 256x256 image. Example from a video of the dataset: grafik Even bringing it out of distribution doesn't cause the image quality to suffer dramatically: grafik

However, one of the main problems with this approach is that we encode roughly 20 frames per second. So, at 1 FPS, we need one 3090-year to encode our 23-year dataset. Luckily, there are faster models such as RQ-VAE.\ According to their paper, RQ-GAN is 7.5x as fast as a comparable VQ-GAN: grafik Using this model would allow us to encode our dataset in ~45 3090-days ($450 on vast.ai), so we should explore it and see if the image quality is comparable.\ Additionally, it uses only 256 tokens to encode an image instead of 1024 tokens, meaning that we could fit four times as many frames into a context. So, instead of using 34 minutes of context as one example, we could go up to 2h16min.

Multimodal AI Art recommended we look at this model. He has a blog and newsletter where he talks about (multimodal) generative AI art on a higher level than what we're doing here. Reading through this accumulated information is likely worthwhile as a literature review.

HomebrewNLP / Olmax

Video Generation via Tokens #5