HomebrewNLP / Olmax

HomebrewNLP in JAX flavour for maintable TPU-Training
BSD 2-Clause "Simplified" License
45 stars 5 forks source link

Video Generation via Tokens #5

Open ClashLuke opened 2 years ago

ClashLuke commented 2 years ago

If we tokenise frames of a video with a VQGAN, we can autoregressively predict the next token using our current language model. More specifically, using our current context of 2 million tokens, we could fit 2048 frames (~34 minutes at 1 FPS) with current state-of-the-art image quantisation models. This issue is about implementing such a model end-to-end and having a working demo.

ClashLuke commented 2 years ago

I did the calculation above with sberbank's vqgan, which is impressive at reconstruction but needs 1024 tokens to encode one 256x256 image. Example from a video of the dataset: grafik Even bringing it out of distribution doesn't cause the image quality to suffer dramatically: grafik

However, one of the main problems with this approach is that we encode roughly 20 frames per second. So, at 1 FPS, we need one 3090-year to encode our 23-year dataset. Luckily, there are faster models such as RQ-VAE.\ According to their paper, RQ-GAN is 7.5x as fast as a comparable VQ-GAN: grafik Using this model would allow us to encode our dataset in ~45 3090-days ($450 on vast.ai), so we should explore it and see if the image quality is comparable.\ Additionally, it uses only 256 tokens to encode an image instead of 1024 tokens, meaning that we could fit four times as many frames into a context. So, instead of using 34 minutes of context as one example, we could go up to 2h16min.

Multimodal AI Art recommended we look at this model. He has a blog and newsletter where he talks about (multimodal) generative AI art on a higher level than what we're doing here. Reading through this accumulated information is likely worthwhile as a literature review.