SanniM3 / video_summarisation_git

GIT with scene change based frame sampling
MIT License
2 stars 0 forks source link

Technical Report

The technical report can be found here

Additional Setup for UoE Team \& convenience scripts

The following are additional setup steps that are needed to get this working. You should still follow all the directions in the official "Introduction" below this.

Installation

  1. Install additional requirements
    pip install -r requirements_2.txt
  2. Download pretrained model
    ./setup_download_model.sh
  3. Install java
    sudo apt install default-jdk
  4. GIT model setup
    pip install -r requirements.txt
    python setup.py build develop

    General Workflow:

1. Download Data

Calling ./setup_download_data.sh will do this for you and setup the following directory structure

video_summarisation_git/data/
|
|-- category.txt ...................... # video category name to id mapping file
|
|-- train_val/ ........................ # dir for training & validation sets
|   |-- train_val_videodatainfo.json .. # annotation file
|   |-- pyscenedetect_frames/ ......... # dir for pyscenedetect sampled frames
|   |-- random_frames/ ................ # auto-generated: dir for randonmly sampled frames
|   |-- transnet_frames/ .............. # auto-generated: dir for transnet sampled frames
|   `-- videos/ ....................... # parent dir for videos (each video should have its own folder inside this dir)
|
`-- test/ ............................. # dir for test set (structure same as train_val)                             
    |-- test_videodatainfo.json ....... # annotation file
    `-- [...] 

2. Sample Frames

You can download presampled frames here

Download and unzip them in /data/train_val or /data/train as appropriate.

Or generate them yourself with a script (will approx 40hrs w/ a k80, or 6hrs+ with a A100)

./setup_sample_frames.sh train # sample frames for training data
./setup_sample_frames.sh test # same for test

Or do it piecemeal by hand:

open /setup_sample_frames.sh to get an idea of the commands to run for each sampling method.

Alternatively, you can look at the actual samplers in /sampling_scripts

3. Create training csv

# for random frames
python command_builder/training_command.py -d data/train_val/random_frames/ -c data/train_val/train_val_videodatainfo.json

# or for transnet frames
python command_builder/training_command.py -d data/train_val/transnet_frames/ -c data/train_val/train_val_videodatainfo.json

# or for pyscenedetect frames
python command_builder/training_command.py -d data/train_val/pyscenedetect_frames/ -c data/train_val/train_val_videodatainfo.json

4. Finetune Model or Download Already Finetuned Models

Download one already finetuned

onedrive
GCloud bucket (faster but costs $$)

finetune your own

Do this for ONE selected sampling method using the following.

Alternatively you can call ./runner.sh which should have everything you need, and will be representative of the last data you called the training command builder on

python -m generativeimage2text.finetune -p '{
    "type": "train",
    "model_name": "GIT_BASE",
    "model_path": "model.pt",
    "batch_size": 3,
    "epochs": 20,
    "train_csv": "data/train_val/{FRAME DIRECTORY HERE}/processed_data_train.csv", # Be sure to swap out {FRAME DIRECTORY HERE} for the directory where your frames are
    "validation_csv": "data/train_val/{FRAME DIRECTORY HERE}/processed_data_validate.csv",
    "validation_annotations_json": "data/train_val/train_val_videodatainfo.json" #path to annotations file
}

5. Run Inference

on test set:

python -m generativeimage2text.vc_inference -p "{'type': 'multi_video_inference', 'videos_csv': '', 'annotations_json_path': '', 'model_path':'./msrvtt_model_epoch1.pt', 'model_name':'GIT_BASE', 'predictions_file':None}"

on multiple models

python -m generativeimage2text.vc_inference -p "{'type': 'multi_video_inference_dir', 'videos_csv': '', 'annotations_json_path': '', 'model_dir':'./model_transnet', 'model_name':'GIT_BASE'}"

Inference Results

Resources Created:

FAQ


below this point is the original readme


Introduction

This repo presents some example codes to reproduce some results in GIT: A Generative Image-to-text Transformer for Vision and Language.

Installation

Inference

Training

The repo shows the key code path of constructing the network input with transformations and forward/backward. The code can be plugged into any trainer easily. Here is the example for the base model.

ImageNet

Class ID to unique readable names

Citation

Please consider to cite the following reference if it helps.

@article{wang2022git,
  title={GIT: A Generative Image-to-text Transformer for Vision and Language},
  author={Wang, Jianfeng and Yang, Zhengyuan and Hu, Xiaowei and Li, Linjie and Lin, Kevin and Gan, Zhe and Liu, Zicheng and Liu, Ce and Wang, Lijuan},
  journal={arXiv preprint arXiv:2205.14100},
  year={2022}
}

Acknowledgement

Part of the code is based on transformers, clip, maskrcnn-benchmark, oscar, virtex.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.