WHB139426 / QA-Prompts

Q&A Prompts: Discovering Rich Visual Clues through Mining Question-Answer Prompts for VQA requiring Diverse World Knowledge [ECCV'24]
4 stars 0 forks source link

Q&A Prompts-ECCV'24

Q&A Prompts: Discovering Rich Visual Clues through Mining Question-Answer Prompts for VQA requiring Diverse World Knowledge. This is the official implementation of the [Paper] accepted by ECCV'24.

Install

  1. Clone this repository and navigate to QA-Prompts folder

    git clone https://github.com/WHB139426/QA-Prompts.git
    cd QA-Prompts
    mkdir experiments
  2. Install Package

    conda create -n qaprompts python=3.9.16
    conda activate qaprompts
    pip install -r requirements.txt
    pip install numpy==1.26.4

Datasets

We prepare the annotations of [A-OKVQA] in ./annotations.

The images can be downloaded from [COCO2017], and you should organize the data as follows,

├── coco2017
│   └── train2017
│   └── val2017
│   └── test2017
├── QA-Prompts
│   └── annotations
│     └── aokvqa_v1p0_train.json
│     └── sub_qa.json
│     └── ...
│   └── datasets
│   └── models
│   └──...

You should also modify the parameter coco_path of argparse in finetune_ans.py/evaluation.py according to the directory of your COCO images.

Pretrained Weights of InstructBLIP

You can prepare the pretrained weights of InstructBLIP-Vicuna-7B according to [InstructBLIP].

Since we have changed the structure of the code of the model, we RECOMMEND you download the pretrained weights of EVA-CLIP, Vicuna-7b-v1.1 and QFormer directly in [🤗HF]. The pretrained weights should be downloaded into the sub folder ./experiments and organized as follows,

├── QA-Prompts
│   └── experiments
│     └── eva_vit_g.pth
│     └── qformer_vicuna.pth
│     └── query_tokens_vicuna.pth
│     └── vicuna-7b
│     └── llm_proj_vicuna.pth

Evaluation

Download the trained checkpoints vicuna_1_0.6969.pth from [🤗HF] (should be downloaded into the sub folder ./experiments), and then run the following script to reproduce the results.

python evaluation.py

Training

We recommend using GPUs with memory > 24G. Otherwise, you may need to extract the vision features in advance to save the memory usage of EVA-CLIP and avoid OOM. Modify the parameter world_size of argparse in finetune_ans.py according to the number of GPUs.

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=8 --master_port=1111 finetune_ans.py