Context-Guided Spatio-Temporal Video Grounding

🔮 Welcome to the official code repository for CG-STVG: Context-Guided Spatio-Temporal Video Grounding. We're excited to share our work with you, please bear with us as we prepare code. Stay tuned for the reveal!

Illustration of Idea

💡 A picture is worth a thousand words!
Can we explore visual context from videos to enhance target localization for STVG? Yes!

CG-STVG Figure: Illustration of and comparison between (a) existing methods that localize the target using object information from text query and (b) our CG-STVG that enjoys object information from text query and guidance from instance context for STVG.

Framework

CG-STVG Figure: Overview of our method, which consists of a multimodal encoder for feature extraction and a context-guided decoder by cascading a set of decoding stages for grounding. In each decoding stage, instance context is mined (by ICG and ICR) to guide query learning for better localization. More details can be seen in the paper.

Implementation

Dataset Preparation

The used datasets are placed in data folder with the following structure.

data
|_ vidstg
|  |_ videos
|  |  |_ [video name 0].mp4
|  |  |_ [video name 1].mp4
|  |  |_ ...
|  |_ vstg_annos
|  |  |_ train.json
|  |  |_ ...
|  |_ sent_annos
|  |  |_ train_annotations.json
|  |  |_ ...
|  |_ data_cache
|  |  |_ ...
|_ hc-stvg2
|  |_ v2_video
|  |  |_ [video name 0].mp4
|  |  |_ [video name 1].mp4
|  |  |_ ...
|  |_ annos
|  |  |_ hcstvg_v2
|  |  |  |_ train.json
|  |  |  |_ test.json
|  |  data_cache
|  |  |_ ...
|_ hc-stvg
|  |_ v1_video
|  |  |_ [video name 0].mp4
|  |  |_ [video name 1].mp4
|  |  |_ ...
|  |_ annos
|  |  |_ hcstvg_v1
|  |  |  |_ train.json
|  |  |  |_ test.json
|  |  data_cache
|  |  |_ ...

The download link for the above-mentioned document is as follows:

hc-stvg: v1_video, annos, data_cache

hc-stvg2: v2_video, annos, [data_cache]()

vidstg: videos, vstg_annos, sent_annos, data_cache

Model Preparation

The used datasets are placed in model_zoo folder

ResNet-101, VidSwin-T, roberta-base

Requirements

The code has been tested and verified using PyTorch 2.0.1 and CUDA 11.7. However, compatibility with other versions is also likely. To install the necessary requirements, please use the commands provided below:

pip3 install -r requirements.txt
apt install ffmpeg -y

Training

Please utilize the script provided below:

# run for HC-STVG
python3 -m torch.distributed.launch \
    --nproc_per_node=8 \
    scripts/train_net.py \
    --config-file "experiments/hcstvg.yaml" \
    INPUT.RESOLUTION 420 \
    OUTPUT_DIR output/hcstvg \
    TENSORBOARD_DIR output/hcstvg

# run for HC-STVG2
python3 -m torch.distributed.launch \
    --nproc_per_node=8 \
    scripts/train_net.py \
    --config-file "experiments/hcstvg2.yaml" \
    INPUT.RESOLUTION 420 \
    OUTPUT_DIR output/hcstvg2 \
    TENSORBOARD_DIR output/hcstvg2

# run for VidSTG
python3 -m torch.distributed.launch \
    --nproc_per_node=8 \
    scripts/train_net.py \
    --config-file "experiments/vidstg.yaml" \
    INPUT.RESOLUTION 420 \
    OUTPUT_DIR output/vidstg \
    TENSORBOARD_DIR output/vidstg

For additional training options, such as utilizing different hyper-parameters, please adjust the configurations as needed: experiments/hcstvg.yaml, experiments/hcstvg2.yaml and experiments/vidstg.yaml.

Evaluation

Please utilize the script provided below:

# run for HC-STVG
python3 -m torch.distributed.launch \
 --nproc_per_node=8 \
 scripts/test_net.py \
 --config-file "experiments/hcstvg.yaml" \
 INPUT.RESOLUTION 420 \
 MODEL.WEIGHT [Pretrained Model Weights] \
 OUTPUT_DIR output/hcstvg

# run for HC-STVG2
python3 -m torch.distributed.launch \
 --nproc_per_node=8 \
 scripts/test_net.py \
 --config-file "experiments/hcstvg2.yaml" \
 INPUT.RESOLUTION 420 \
 MODEL.WEIGHT [Pretrained Model Weights] \
 OUTPUT_DIR output/hcstvg2

# run for VidSTG
python3 -m torch.distributed.launch \
 --nproc_per_node=8 \
 scripts/test_net.py \
 --config-file "experiments/vidstg.yaml" \
 INPUT.RESOLUTION 420 \
 MODEL.WEIGHT [Pretrained Model Weights] \
 OUTPUT_DIR output/vidstg

Pretrained Model Weights

We provide our trained checkpoints for results reproducibility.

Dataset	resolution	url	m_vIoU/vIoU@0.3/vIoU@0.5	size
HC-STVG	420	Model	38.4/61.5/36.3	3.4 GB
HC-STVG2	420	Model	39.5/64.5/36.3	3.4 GB
VidSTG	420	Model	34.0/47.7/33.1	3.4 GB

Experiments

🎏 CG-STVG achieves state-of-the-art performance on three challenging benchmarks, including HCSTVG-v1, HCSTVG-v2, and VidSTG, as shown below. Note that, the baseline is our CG-STVG without context generation and refinement.

Results on HCSTVG-v1

Methods	M_tIoU	m_vIoU	vIoU@0.3	vIoU@0.5
STGVT_[TCSVT'2021]	-	18.2	26.8	9.5
STVGBert_[ICCV'2021]	-	20.4	29.4	11.3
TubeDETR_[CVPR'2022]	43.7	32.4	49.8	23.5
STCAT_{[NeurIPS'2022]}	49.4	35.1	57.7	30.1
CSDVL_[CVPR'2023]	-	36.9	62.2	34.8
Baseline (ours)	50.4	36.5	58.6	32.3
CG-STVG (ours)	52.8_(+2.4)	38.4_(+1.9)	61.5_(+2.9)	36.3_(+4.0)

Results on HCSTVG-v2

Methods	M_tIoU	m_vIoU	vIoU@0.3	vIoU@0.5
PCC_[arxiv'2021]	-	30.0	-	-
2D-Tan_[arxiv'2021]	-	30.4	50.4	18.8
MMN_[AAAI'2022]	-	30.3	49.0	25.6
TubeDETR_[CVPR'2022]	-	36.4	58.8	30.6
CSDVL_[CVPR'2023]	58.1	38.7	65.5	33.8
Baseline (ours)	58.6	37.8	62.4	32.1
CG-STVG (ours)	60.0_(+1.4)	39.5_(+1.7)	64.5_(+2.1)	36.3_(+4.2)

Results on VidSTG

Methods	Declarative Sentences				Interrogative Sentences
Methods	M_tIoU	m_vIoU	vIoU@0.3	vIoU@0.5	M_tIoU	m_vIoU	vIoU@0.3	vIoU@0.5
STGRN_[CVPR'2020]	48.5	19.8	25.8	14.6	47.0	18.3	21.1	12.8
OMRN_[IJCAI'2020]	50.7	23.1	32.6	16.4	49.2	20.6	28.4	14.1
STGVT_[TCSVT'2021]	-	21.6	29.8	18.9	-	-	-	-
STVGBert_[ICCV'2021]	-	24.0	30.9	18.4	-	22.5	26.0	16.0
TubeDETR_[CVPR'2022]	48.1	30.4	42.5	28.2	46.9	25.7	35.7	23.2
STCAT_{[NeurIPS'2022]}	50.8	33.1	46.2	32.6	49.7	28.2	39.2	26.6
CSDVL_[CVPR'2023]	-	33.7	47.2	32.8	-	28.5	39.9	26.2
Baseline (ours)	49.7	32.4	45.0	31.4	48.8	27.7	38.7	25.6
CG-STVG (ours)	51.4 _(+1.7)	34.0 _(+1.6)	47.7 _(+2.7)	33.1 _(+1.7)	49.9 _(+1.1)	29.0 _(+1.3)	40.5 _(+1.8)	27.5 _(+1.9)

Acknowledgement

This repo is partly based on the open-source release from STCAT and the evaluation metric implementation is borrowed from TubeDETR for a fair comparison.

Citation

⭐ If you find this repository useful, please consider giving it a star and citing it:

@inproceedings{gu2024context,
  title={Context-Guided Spatio-Temporal Video Grounding},
  author={Gu, Xin and Fan, Heng and Huang, Yan and Luo, Tiejian and Zhang, Libo},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={18330--18339},
  year={2024}
}

HengLan / CGSTVG

readme