dvlab-research / LLaMA-VID

LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models (ECCV 2024)
Apache License 2.0
693 stars 43 forks source link

Are all video-based checkpoints trained with 2 tokens? #82

Open haodi19 opened 5 months ago

haodi19 commented 5 months ago

Hello, thank you for your great work. I noticed that in the open checkpoints, all checkpoints trained on video data have the compress type as 'mean' (or 'mean_concat', but I couldn't find the corresponding logic in the code). Are all video-based checkpoints, regardless of whether the training data is short or long videos, trained with 2 tokens?