Open NilanEkanayake opened 1 month ago
@NilanEkanayake I would be extremely interested in this because I am currently trying to tokenise sign language videos to input into an LLM here for translation tasks!
@NilanEkanayake I would be extremely interested in this because I am currently trying to tokenise sign language videos to input into an LLM here for translation tasks!
It compresses fixed-length videos, so not sure how well it would work for that. You'd have to string multiple tokenized videos together depending on the length of the input.
You might have better luck training a custom model from scratch, where the model takes in the videos and produces a translation, instead of using an LLM with a video tokenizer on top.
Have you tried using pose estimation methods to feed to the LLM instead? Would bypass the tokenizer quality and be a lot more flexible.
I made some changes to the model (3D convs) and trained the small one with 128 tokens on 128p 16-frame videos pre-compressed with CogvideoX's VAE and MSE loss. Turned out better than I expected considering how fast the training was on consumer hardware (couple hours).
There's a lot of potential here, and I think I can improve the performance a lot further.