WillDreamer / Aurora

[NeurIPS2023] Parameter-efficient Tuning of Large-scale Multimodal Foundation Model
https://arxiv.org/abs/2305.08381
83 stars 7 forks source link
multi-modal-learning parameter-efficient-tuning

[NeurIPS 2023] Parameter-efficient Tuning of Large-scale Multimodal Foundation Model

Introduction

Aurora is an efficient PETL method used in multimodal large model fields. It uses mode approximation to further reduce the trainable parameters and promote the fusion of different modalities.

1. Comparison with other PETL methods image

2. Overall architecture image

Getting Started

Requirements

Datasets

1.Image-text Retrieval Task

COCO2014: download dataset through https://cocodataset.org/#download, you can use such Linux command [wget -c http://images.cocodataset.org/annotations/annotations_trainval2014.zip] to help you download directly.

Flickr30k: download dataset through https://shannon.cs.illinois.edu/DenotationGraph/data/index.html; or you can download through this link: https://pan.baidu.com/s/1r0RVUwctJsI0iNuVXHQ6kA, the password is hrf3.

2.Video-text Retrieval

MSRVTT: download the video dataset in https://www.mediafire.com/folder/h14iarbs62e7p/shared, and the corresponding annotation file in https://mega.nz/file/UnRnyb7A#es4XmqsLxl-B7MP0KAat9VibkH7J_qpKj9NcxLh8aHg.

DiDemo: download the dataset through this Github project https://github.com/jpthu17/EMCL.

3.Visual Question Answering Task

VQAv2: The COCO dataset can be downloaded through https://visualqa.org/download.html, and the additional VG dataset can be downloaded through this GitHub project https://github.com/jayleicn/ClipBERT.

VideoQA: The video dataset is come from MSRVTT, and the annotation file can be downloaded through this GitHub project https://github.com/jayleicn/ClipBERT.

Image-text Retrieval

Visual Question Answering

Video-text Retrieval and VideoQA

Acknowledgement

Our codebase is built based on BLIP, timm, and transformers. We thank the authors for the nicely organized code!

How To Cite Aurora

If you use this code in your research, please kindly cite the following paper:

@article{wang2023mode,
  title={Mode Approximation Makes Good Vision-Language Prompts},
  author={Wang, Haixin and Yang, Xinlong and Chang, Jianlong and Jin, Dian and Sun, Jinan and Zhang, Shikun and Luo, Xiao and Tian, Qi},
  journal={arXiv preprint arXiv:2305.08381},
  year={2023}
}

@inproceedings{wang2023parameter,
  title={Parameter-efficient Tuning of Large-scale Multimodal Foundation Model},
  author={Wang, Haixin and Yang, Xinlong and Chang, Jianlong and Jin, Dian and Sun, Jinan and Zhang, Shikun and Luo, Xiao and Tian, Qi},
  booktitle={Thirty-seventh Conference on Neural Information Processing Systems},
  year={2023}
}