UCF-SST-Lab / AICity-2024-Track2-CVPRW

This is open source code for AI City Challenge Track 2 Traffic Safety Description and Analysis.
MIT License
6 stars 1 forks source link

Enhancing Traffic Safety with Parallel Dense Video Captioning for End-to-End Event Analysis

This repository contains code to reproduce the results for our paper Enhancing Traffic Safety with Parallel Dense Video Captioning for End-to-End Event Analysis (CVPRW 2024 AICity Track2)

Table of Contents:

Preparation

  1. Clone the repo

    git clone --recursive https://github.com/UCF-SST-Lab/AICity-2024-Track2-CVPRW
  2. Create vitual environment by conda

    conda create -n PDVC python=3.7
    source activate PDVC
    conda install pytorch==1.7.1 torchvision==0.8.2 cudatoolkit=10.1 -c pytorch
    conda install ffmpeg
    pip install -r requirement.txt
  3. Compile the deformable attention layer (requires GCC >= 5.4).

    cd pdvc/ops
    sh make.sh

Feature Data

The CLIP features (Training/Test) extracted from BDD and WTS can be downloaded via Google Drive

Training Dense Video Captioning

Train and evaluate models with command lines

# Training
config_path=cfgs/bdd_veh_clip_pdvcl.yml
python train.py --cfg_path ${config_path} --gpu_id ${GPU_ID} --epoch=30
# The script will evaluate the model given specified evaluation epochs. The results and logs are saved in `./save`.

# Evaluation
eval_folder=bdd_eval # specify the folder to be evaluated
python eval.py --eval_folder ${eval_folder} --eval_transformer_input_type queries --gpu_id ${GPU_ID}

Train and evaluate models with bash script

bash run.sh

Notes: In bash file, --load=save/XXX has to be updated with the folder containing obtained models.

Submission File Preparation

python formatting_submission.py

Performance

Model Features Data BLEU4 METEOR ROUGE-L CIDEr S2 config_path
PDVC_light CLIP BDD 0.2102 0.4435 0.4705 0.8698 30.2821 cfgs/bdd_xxx_clip_pdvcl.yml
PDVC_light CLIP WTS 0.2005 0.4115 0.4416 0.5573 27.7347 cfgs/train_wts_xxx_xxx_pdvcl_finetune.yml

Citation

If you find this repo helpful, please consider citing:

@article{shoman2024enhancing,
  title={Enhancing Traffic Safety with Parallel Dense Video Captioning for End-to-End Event Analysis},
  author={Shoman, Maged and Wang, Dongdong and Aboah, Armstrong and Abdel-Aty, Mohamed},
  journal={arXiv preprint arXiv:2404.08229},
  year={2024}
}
@article{wang20248th,
  title={The 8th AI City Challenge},
  author={Wang, Shuo and Anastasiu, David C and Tang, Zheng and Chang, Ming-Ching and Yao, Yue and Zheng, Liang and Rahman, Mohammed Shaiqur and Arya, Meenakshi S and Sharma, Anuj and Chakraborty, Pranamesh and others},
  journal={arXiv preprint arXiv:2404.09432},
  year={2024}
}

Acknowledgement

The implementation of PDVC is modified based on PDVC.
The implementation of video feature extraction is modified based on FrozenBiLM.
The implementation of Deformable Transformer is mainly based on Deformable DETR.
The implementation of the captioning head is based on ImageCaptioning.pytorch. We thanks the authors for their efforts.