Prepare the data as follows.
TextVID: download TextVID at TextVid only. download TextVID and preprocessed features at TextVid and features
NeXTQA, STAR and TVQA: The prepocessed feautures are available at here.
EgoScehma: Download raw videos from EgoSchema. We provide prepocessed feature here
MVBench: Download raw videos from Hugging Face.
MSRVTT: Download raw videos from MSRVTT.
./data
|─ nextqa
| |─ train.csv
| |─ val.csv
| └─ clipvitl14.pth
|─ star
| :
|─ tvqa
| :
└─ egos
:
Prepare the model as follows.
LLMs: Download the pretrained Llama models from Llama2 and Llama3.
TOPA Checkpoints: Download our pretrained models
./pretrained
└─ llama2
| |─ 7B
| | |─ consolidated.00.pth
| | └─ params.json
| |─ 13B
| | :
| | :
| └─ tokenizer.model
└─ llama3
|─ 8B
| |─ consolidated.00.pth
| └─ params.json
└─ tokenizer.model
./vqa_checkpoint
└─ checkpoint_pretrain
|─ llama2_7b
|─ llama2_13b
└─ llama3_8b
./scripts/pretrain/llama2_7b.sh
./scripts/eval/zeroshot_eval_egos.sh
./scripts/eval/zeroshot_eval_nextqa.sh
./scripts/eval/zeroshot_eval_star.sh
./scripts/eval/zeroshot_eval_tvqa.sh
This repo is built upon Flipped-VQA and benefits from LLaMA-Adapter, DeCap, MVBench, Llama2 and Llama3.
@article{li2024topa,
title={TOPA: Extend Large Language Models for Video Understanding via Text-Only Pre-Alignment},
author={Li, Wei and Fan, Hehe and Wong, Yongkang and Kankanhalli, Mohan and Yang, Yi},
journal={arXiv preprint arXiv:2405.13911},
year={2024}
}