Closed lucasjinreal closed 5 months ago
That's a great question. Although our tokenizer is trained on the "understanding datasets", e.g., UCF-101 and Kinetics-400, the training process doesn't involve any semantic learning or alignment. Therefore, it's unclear whether using our tokenizer for video understanding will lead to satisfactory results, but I do think it's worth a try.
Will the tokenize image tokens able to do understanding?