bytedance / 1d-tokenizer

This repo contains the code for our paper An Image is Worth 32 Tokens for Reconstruction and Generation
Apache License 2.0
429 stars 16 forks source link

Experiments with video tokenization. #37

Open NilanEkanayake opened 3 days ago

NilanEkanayake commented 3 days ago

I made some changes to the model (3D convs) and trained the small one with 128 tokens on 128p 16-frame videos pre-compressed with CogvideoX's VAE and MSE loss. Turned out better than I expected considering how fast the training was on consumer hardware (couple hours).

There's a lot of potential here, and I think I can improve the performance a lot further. Untitled Untitled-1 Untitled-2 Untitled-5 Untitled-4

tanzheen commented 3 days ago

@NilanEkanayake I would be extremely interested in this because I am currently trying to tokenise sign language videos to input into an LLM here for translation tasks!

NilanEkanayake commented 3 days ago

@NilanEkanayake I would be extremely interested in this because I am currently trying to tokenise sign language videos to input into an LLM here for translation tasks!

It compresses fixed-length videos, so not sure how well it would work for that. You'd have to string multiple tokenized videos together depending on the length of the input.

You might have better luck training a custom model from scratch, where the model takes in the videos and produces a translation, instead of using an LLM with a video tokenizer on top.

Have you tried using pose estimation methods to feed to the LLM instead? Would bypass the tokenizer quality and be a lot more flexible.