TXH-mercury / VAST

Code and Model for VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
https://arxiv.org/abs/2305.18500
MIT License
243 stars 17 forks source link
audio-language cross-modality-pretraining dataset multimodal-foundation-model vision-audio-subtitle-text vision-language

[NIPS2023]VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset

PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC

Building Environment

VAST is implemented based on Pytorch. We use Python-3.9 and Cuda-11.7. Other version could be also compatible. Other needed packages are listed in preinstall.sh.

conda create -n vast python=3.9
conda activate vast
sh preinstall.sh

Download basic encoder's pretrained checkpoints

make a dir named pretrained_weights under the main work dir.

1.download evaclip weight:

wget -P pretrained_weights/clip/ https://huggingface.co/QuanSun/EVA-CLIP/resolve/main/EVA01_CLIP_g_14_psz14_s11B.pt

2.download beats weight from https://github.com/microsoft/unilm/tree/master/beats

3.download bert weight:

from transformers import BertModel, BertTokenizer
bert = BertModel.from_pretrained('bert-base-uncased')
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert.save_pretrained('pretrained_weights/bert/bert-base-uncased')
bert_tokenizer.save_pretrained('pretrained_weights/bert/bert-base-uncased')

The processed pretrained_weights path should be as follows:

    ├── pretrained_weights
    │   ├── beats
    │   │   └── BEATs_iter3_plus_AS2M.pt
    │   ├── bert
    │   │   └── bert-base-uncased
    │   ├── clip
    │   │   └── EVA01_CLIP_g_14_psz14_s11B.pt

Download VAST models and captioners (for labeling your own data)

make a dir named output under the main work dir.

1.download vast model (optional, for finetuning)

[Google Drive Link] [Baidu Cloud Link]

2.vision captioner (optional, for labeling images/videos)

[Google Drive Link] [Baidu Cloud Link]

3.audio captioner (optional, for labeling audios)

[Google Drive Link] [Baidu Cloud Link]

The processed output path should be as follows:

    ├── output
    │   ├── vast
    │   │   ├── pretrain_vast
    │   │   ├── vision_captioner
    │   │   └── audio_captioner

Download VAST-27M annotations for pretraining

[Google Drive Link] [Baidu Cloud Link]

Raw videos could be downloaded from YouTube.

Download downstream datasets annotations for finetuning

make a dir named datasets under the main work dir.

[Google Drive Link] [Baidu Cloud Link]

The processed datasets path should be as follows:

    ├── output
    │   ├── annotations
    │   │   ├── msrvtt
    │   │   ├── ...
    │   │   └── msvd
    │   ├── srcdata
    │   │   ├── msrvtt
    │   │   ├── ...
    │   │   └── msvd

srcdata (images/videos/audios) should be collected by yourself.

Finetune Model

Pretrain Model

sh scripts/pretrain_vast.sh

Test your finetuned Model

For example, if the cmd for finetuning retrieval model is as follows:

python3 -m torch.distributed.launch \
--nnodes 1 \
--node_rank 0 \
--nproc_per_node 8 \
--master_port 9834 \
./run.py \
--learning_rate 2e-5 \
--checkpointing true \
--first_eval true \
--save_best true \
--config ./config/vast/finetune_cfg/retrieval-msrvtt.json \
--pretrain_dir $output_dir \
--output_dir $output_dir/downstream/retrieval-msrvtt \

if you want to test model, just add following two rows to the cmd:

--mode 'testing' \
--checkpoint /PATH/TO/SAVED_CHECKPOINT.pt

Labeling your own data use vast's captioner

You need to prepare 1)a folder containing all videos/images or audios.

2)a meta.json composed of [{'video_id':'09WssDay9FE_1'},{'video_id':'09WssDay9FE_2'},...]

and then write the config file.

sh scripts/vast/vision_captioner.sh
sh scripts/vast/audio_captioner.sh

Statement of common controllable items in cmd which can overwrite config files.

--train_vision_sample_num

--test_vision_sample_num

--train_audio_sample_num

--test_audio_sample_num

--train_task

--test_task

--learning_rate

--train_batch_size

--test_batch_size

--train_epoch

--train_steps

--checkpointing

--frozen_vision

--valid_freq

--beam_size

Citation

If you find this code useful for your research, please consider citing:

@article{chen2024vast,
  title={Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset},
  author={Chen, Sihan and Li, Handong and Wang, Qunbo and Zhao, Zijia and Sun, Mingzhen and Zhu, Xinxin and Liu, Jing},
  journal={Advances in Neural Information Processing Systems},
  volume={36},
  year={2024}
}

License

This project is released under the MIT license

Third-Party Licenses

For the full list of third-party licenses used in this project, please see the THIRD_PARTY_LICENSES.md file.