Aurora is an efficient PETL method used in multimodal large model fields. It uses mode approximation to further reduce the trainable parameters and promote the fusion of different modalities.
1. Comparison with other PETL methods
2. Overall architecture
COCO2014: download dataset through https://cocodataset.org/#download, you can use such Linux command [wget -c http://images.cocodataset.org/annotations/annotations_trainval2014.zip] to help you download directly.
Flickr30k: download dataset through https://shannon.cs.illinois.edu/DenotationGraph/data/index.html; or you can download through this link: https://pan.baidu.com/s/1r0RVUwctJsI0iNuVXHQ6kA, the password is hrf3.
MSRVTT: download the video dataset in https://www.mediafire.com/folder/h14iarbs62e7p/shared, and the corresponding annotation file in https://mega.nz/file/UnRnyb7A#es4XmqsLxl-B7MP0KAat9VibkH7J_qpKj9NcxLh8aHg.
DiDemo: download the dataset through this Github project https://github.com/jpthu17/EMCL.
VQAv2: The COCO dataset can be downloaded through https://visualqa.org/download.html, and the additional VG dataset can be downloaded through this GitHub project https://github.com/jayleicn/ClipBERT.
VideoQA: The video dataset is come from MSRVTT, and the annotation file can be downloaded through this GitHub project https://github.com/jayleicn/ClipBERT.
Download COCO and Flickr30k datasets, and set 'imageroot' in configs/retrieval{dataset}.yaml accordingly.
To parameter-efficient finetune on MSCOCO/Flickr:
python -m torch.distributed.run --nproc_per_node=8 train_retrieval.py --config ./configs/retrieval_{coco, flickr}.yaml --output_dir output/{coco, flickr}
To evaluate on MSCOCO/Flickr:
python -m torch.distributed.run --nproc_per_node=8 train_retrieval.py --config ./configs/retrieval_{coco, flickr}.yaml --output_dir output/{coco, flickr} --evaluate
Download VQAv2 dataset and Visual Genome dataset, and set 'vqa_root' and 'vg_root' in configs/vqa.yaml.
To parameter-efficient finetune on VQAv2:
python -m torch.distributed.run --nproc_per_node=8 train_vqa.py --config ./configs/vqa.yaml --output_dir $static_dir
To evaluate on VQAv2 (need to update the result file to the official server, the server website is [https://eval.ai/web/challenges/challenge-page/830/leaderboard/2278]):
python -m torch.distributed.run --nproc_per_node=8 train_vqa.py --config ./configs/vqa.yaml --output_dir $static_dir --evaluate
Download MSRVTT and DiDemo datasets, and set 'video_root' & 'annroot' in configs/retrieval{dataset}.yaml accordingly.
To parameter-efficient finetune on MSRVTT:
python -m torch.distributed.run --nproc_per_node=8 train_video_retrieval.py --config ./configs/retrieval_msrvtt.yaml --output_dir $static_dir
To parameter-efficient finetune on DiDemo:
python -m torch.distributed.run --nproc_per_node=8 train_video_retrieval.py --config ./configs/retrieval_didemo.yaml --output_dir $static_dir
To parameter-efficient finetune on VideoQA:
python -m torch.distributed.run --nproc_per_node=8 train_vqa.py --config ./configs/videoqa.yaml --output_dir $static_dir
Our codebase is built based on BLIP, timm, and transformers. We thank the authors for the nicely organized code!
If you use this code in your research, please kindly cite the following paper:
@article{wang2023mode,
title={Mode Approximation Makes Good Vision-Language Prompts},
author={Wang, Haixin and Yang, Xinlong and Chang, Jianlong and Jin, Dian and Sun, Jinan and Zhang, Shikun and Luo, Xiao and Tian, Qi},
journal={arXiv preprint arXiv:2305.08381},
year={2023}
}
@inproceedings{wang2023parameter,
title={Parameter-efficient Tuning of Large-scale Multimodal Foundation Model},
author={Wang, Haixin and Yang, Xinlong and Chang, Jianlong and Jin, Dian and Sun, Jinan and Zhang, Shikun and Luo, Xiao and Tian, Qi},
booktitle={Thirty-seventh Conference on Neural Information Processing Systems},
year={2023}
}