FoundationVision / OmniTokenizer

OmniTokenizer: one model and one weight for image-video joint tokenization.
https://www.wangjunke.info/OmniTokenizer/
MIT License
234 stars 5 forks source link

Will the tokenize image tokens able to do understanding? #4

Closed lucasjinreal closed 3 months ago

lucasjinreal commented 3 months ago

Will the tokenize image tokens able to do understanding?

wdrink commented 3 months ago

That's a great question. Although our tokenizer is trained on the "understanding datasets", e.g., UCF-101 and Kinetics-400, the training process doesn't involve any semantic learning or alignment. Therefore, it's unclear whether using our tokenizer for video understanding will lead to satisfactory results, but I do think it's worth a try.