Download dataset annotations zip from box or google drive. Then unzip the downloaded datasets under shared_datasets/
. The resulting shared_dataset folder structure is expected to be:
shared_datasets
├── README.md
├── MSRVTT_caption
├── MSRVTT_qa
...
Then, please refer to Dataset Instruction for downloading and processing raw videos.
Download BLIP checkpoints:
bash download_blip_checkpoints.sh
Download Input & Output Examples zip from box or google dirve. Unzip the folders under output_example/
, the resulting output_example/
folder structure is expected to be:
output_example
├── msrvtt
├── msvd_test
├── vlep_test
└── README.md
[Update 6/17] GPT-3 Results for Video Captioning, Video Question Answering and VLEP can be downloaded here.
launch the docker environment:
run_docker_vidil.sh
bash run_docker_vidil.sh
set up GPU devices: within the docker image, set up the following environment variables to config GPT devices
export N_GPU=<num of gpus>
export CUDA_VISIBLE_DEVICES=<0,1,2...>
The following scripts runs the entire pipeline which, (1) generates frame captions; (2) generates visual tokens (3) generates few-shot prompt readily for GPT-3. The output folder have the following structure:
{dataset_split}
├── frame_caption
│ ├── config.yaml # config for frame captioning
│ ├── video_text_Cap.json # frame captions w/o filtering
│ ├── video_text_CapFilt.json # frame captions w/ filtering
├── input_prompts
│ ├── {output_name}.jsonl # config for frame captioning
│ ├── {output_name}__idx_2_videoid.json # line idx to video id
│ ├── {output_name}__chosen_samples.json # chosen examples in the support
│ ...
├── visual_tokenization_{encoder_name}
│ ├── config.yaml # config for visual tokenization
│ └── visual_tokens.json # raw visual tokens of each frame
└──
All scripts should be run at /src
dir, namely, the root directory after running the docker image. The following are examples for running the pipeline with in-context example selection for some datasets. For additional notes on running pipeline scripts, please refer to Pipeline Instruction.
Since we need to sample few-shot support set from training sets, for each dataset, at the first time running the pipeline, we need to do frame captioning and visual tokenization on the training set.
For <dataset> in ["msrvtt","youcook2","vatex","msvd","vlep"]
:
bash pipeline/scripts/run_frame_captioning_and_visual_tokenization.sh <dataset> train <output_root>
An example of the frame caption and visual token dir can be found at: output_example/msrvtt/frame_caption
, output_example/msrvtt/visual_tokenization_clip
For <dataset> in ["msrvtt","youcook2","vatex"]
:
(1) Run the Standalone Frame Captioning and Visaul Tokenization pipieline for the chosen <dataset>
(2) Run pipeline for generating video captioning prompts for <dataset>
<split> in ["train","val","test"]
bash pipeline/scripts/generate_gpt3_query_pipeline_caption_with_in_context_selection.sh <dataset> <split> <output_root> 10 42 5 caption
bash pipeline/scripts/generate_gpt3_query_pipeline_caption_with_in_context_selection_with_asr.sh <dataset> <split> <output_root> 10 42 5 caption_asr
An example of the output prompt jsonl can be found at output_example/msrvtt/input_prompts/temp_0.0_msrvtt_caption_with_in_context_selection_clip_shot_10_seed_42_N_5.jsonl
.
For <dataset> in ["msrvtt","msvd"]
:
(1) Run the Standalone Frame Captioning and Visaul Tokenization pipieline for the chosen <dataset>
(2) Run pipeline for generating video question answering prompts for <dataset>
<split> in ["train","val","test"]
bash pipeline/scripts/generate_gpt3_query_pipeline_qa_with_in_context_selection.sh <dataset> <split> <output_root> 5 42 5 question
An example of the output prompt jsonl can be found at output_example/msvd_test/input_prompts/temp_0.0_gpt3_queries_msvd_qa_clip_shot_5_seed_42.jsonl
.
(1) Run the Standalone Frame Captioning and Visaul Tokenization pipieline for the chosen vlep
(2) Run pipeline for generating vlep prompts
bash pipeline/scripts/generate_gpt3_query_pipeline_vlep_with_random_context_asr_multichoice.sh <dataset> <split> <output_root> 10 42
An example of the output prompt jsonl can be found at output_example/vlep_test/input_prompts/temp_0.0_vlep_test_clip_shot_10_seed_42_multichoice.jsonl
.
For semi-supervised setting, we first generate pseudo label on the training set, we then train BLIP on the pseudo labeled dataset for retrieval.
(1) Generate pseudo labeled training set annotation json: suppose we have the raw gpt3 response stored at <gpt3_response_dir>
, the input_prompt dir is at <input_prompts_dir>
, run:
python utils_gpt3/process_gpt3_response.py --gpt3_response_dir <gpt3_response_dir> --input_prompts_dir <input_prompts_dir> --output_dir <processed_response_dir>
python utils_gpt3/gpt3_response_to_jsonl.py --dataset <dataset_name> --gpt3_processed_dir <processed_response_dir> --output_dir <pseudo_label_ann_output_dir>
An example of the <gpt3_response_dir>
, <input_prompts_dir>
, <processed_response_dir>
and pseudo_label_ann_output_dir
can be found at output_example/msrvtt/gpt3_response
, output_example/msrvtt/input_prompts
, output_example/msrvtt/processed_response_dir
, output_example/msrvtt/pseudo_label_ann
.
(2) Finetune pretrained BLIP from pseudo labeled data:
For <dataset> in ["msrvtt","vatex"]
, set the value of the field named train_ann_jsonl
in configs/train_blip_video_retrieval_<dataset>_pseudo.yaml
to be the path to the output jsonl from step one in <pseudo_label_ann_output_dir>
. Then run:
bash scripts/train_caption_video.sh train_blip_video_retrieval_<dataset>_pseudo
Scripts for evaluating generation results from GPT-3:
Video Captioning: please refer to the example written in the script for more details about the required inputs
bash scripts/evaluation/eval_caption_from_gpt3_response.sh
Question Answering: please refer to the example written in the script for more details about the required inputs
bash scripts/evaluation/eval_qa_from_gpt3_response.sh
VLEP:
(1) get the processed gpt3 response; an example of the: <gpt3_response_dir>
, <input_prompts_dir>
and <processed_response_dir>
can be found at: output_example/vlep_test/gpt3_response
, output_example/vlep_test/input_prompts
, output_example/vlep_test/gpt3_response_processed
python utils_gpt3/process_gpt3_response.py --gpt3_response_dir <gpt3_response_dir> --input_prompts_dir <input_prompts_dir> --output_dir <processed_response_dir>
(2) run the following script to generate the output in the official format for CodaLab submission; an example of the output jsonl can be found at output_example/vlep_test/evaluation/temp_0.0_vlep_test_clip_shot_10_seed_42_multichoice_eval.jsonl
python eval_vlep.py --gpt3_processed_response <processed_response_json> --output_path <output_jsonl_path>
@article{wang2022language,
title={Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners},
author={Wang, Zhenhailong and Li, Manling and Xu, Ruochen and Zhou, Luowei and Lei, Jie and Lin, Xudong and Wang, Shuohang and Yang, Ziyi and Zhu, Chenguang and Hoiem, Derek and others},
journal={arXiv preprint arXiv:2205.10747},
year={2022}
}
The implementation of VidIL relies on resources from BLIP, ALPRO, transformers. We thank the original authors for their open-sourced code and encourage users to cite their works when applicable.