TXH-mercury / VALOR

Codes and Models for VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
https://arxiv.org/abs/2304.08345
MIT License
260 stars 16 forks source link
audio-language-pretraining audiovisual-language-pretraining multimodal-representation-learning vision-language-pretraining

VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset

PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC

Building Environment

pip install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html

Download Checkpoints

Model Pretrained Ckpt Finetuned Ckpt on MSRVTT-Retrieval Finetuned Ckpt on MSRVTT-Caption
VALOR-B VALOR-base VALOR_base_msr_ret.pt VALOR_base_msr_cap.pt
VALOR-L VALOR-large VALOR_large_msr_ret.pt VALOR_large_msr_cap.pt

Put VALOR-base and VALOR-large under the output dir. (VALOR/output/VALOR-base, VALOR/output/VALOR-large)

Prepare Datasets

VALOR is pretrained and tested on multiple vision-language, audio-language and audiovisual-language datasets. e.g. PRETRAIN: VALOR-1M, WebVid-2.5M, CC-3M (VALOR-base) TEST: VALOR-32K, MSRVTT, MSVD, DiDeMo, LSMDC, ActivityNet, VATEX, AudioCaps, ClothoV1, TGIF-Frame, MSCOCO, VQAV2... We here take MSRVTT as an example to show the data processing procedures, other datasets take a similar way.

The processed dataset path should be as follows:

    ├── datasets
    │   ├── msrvtt
    │   │   ├── raw_videos
    │   │   │    ├── video0.mp4
    │   │   │    └── video1.mp4
    │   │   ├── frames_fps4
    │   │   │    ├── video0
    │   │   │    │   ├──img_0001.jpg
    │   │   │    │   └──img_0002.jpg
    │   │   │    └── video1
    │   │   │    │   ├──img_0001.jpg
    │   │   │    │   └──img_0002.jpg
    │   │   ├── audio_22050hz
    │   │   │    ├── video1.wav
    │   │   │    └── video3.wav
    │   │   ├── standardsplit_train_id.json
    │   │   ├── standardsplit_test_id.json
    │   │   ├── 1KAsplit_train_id.json
    │   │   ├── 1KAsplit_test_id.json
    │   │   ├── txt_mapper.json
    │   │   ├── txt_mapper_1kAsplit_test.json    
    │   │   ├── txt_mapper_vqa.json    
    │   │   └── caption_annotation.json    

We provide processed json files for most finetuneing datasets here, and you only need to download and extract raw videos of each dataset.

Finetune Model

Test Model

For example, the cmd for finetuning retrieval model in scripts/finetune_ret.sh is as follows:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=8   --master_port 32711 ./train.py \
--pretrain_dir $basedir \
--config ./config/fast-retrieval-msrvtt.json \
--output_dir $basedir'/ret-msrvtt-lr2e-5-bs64-epoch5'   \
--learning_rate 2e-5  \
--train_video_sample_num 4 \
--test_video_sample_num 8  \
--save_best true \

if you want to test model, just add following two rows to the cmd:

--zero_shot \
--checkpoint $checkpoint_save_path(.pt)

Pretrain Model

sh scripts/pretrain.sh

Inference

For QA task

python inference.py --video_path $VIDEOPATH --task 'qa%tva' --model_dir $MODELDIR --question 'what is in the video'

For caption task

python inference.py --video_path $VIDEOPATH --task 'cap%tva' --model_dir $MODELDIR 

Customize

VALOR's framework is easy to expand new tasks/datasets. what you need to do is

  1. prepare dataset as illustrated above
  2. write config file (copy a config file and change 'data_cfg')

Citation

If you find this code useful for your research, please consider citing:

@article{chen2023valor,
  title={VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset},
  author={Chen, Sihan and He, Xingjian and Guo, Longteng and Zhu, Xinxin and Wang, Weining and Tang, Jinhui and Liu, Jing},
  journal={arXiv preprint arXiv:2304.08345},
  year={2023}
}

License

MIT -->