Official code for our ICLR2023 paper: "Compositional Prompt Tuning with Motion Cues for Open-Vocabulary Video Relation Detection"
openreview link
for checkpoint, refer to links in this issue
Python == 3.7 or later, Pytorch == 1.7 or later
transformers == 4.11.3 (new version might require some modifications for ALpro's code, but also works, refer to Line 872 in Alpro_modeling/xbert.py
)
for other basic packages, just run the project and download whatever needed.
Actually, the raw video data (.mp4 files) is not required to run this repo. We provide the pre-prepared traj data (include bbox and features)
Overview: There are 3 types of data
For each type of the above data, it includes gt
and det
, i.e., ground-truth traj bboxes and detection traj bboxes, with their features/embds. (certainly, we don't need Seq-NMS to perform tracking for gt
)
Please refer to this repo VidSGG-TrajDataPrepare for how to prepare the above traj data.
In detail, there are the following files: (where data0/
refers to /home/gkf/project/
)
object category text embedding: vidvrd_ObjTextEmbeddings.pth
corresponding to (c.t.) data0/VidVRD-OpenVoc/prepared_data/vidvrd_ObjTextEmbeddings.pth
traj bbox
vidvrd_traj_box_gt.zip
, c.t. data0/scene_graph_benchmark/output/VidVRDtest_tracking_results_gt
vidvrd_traj_box_gt_trainset.zip
, c.t. data0/scene_graph_benchmark/output/VidVRD_tracking_results_gt
vidvrd_traj_box_det.zip
, c.t. data0/VidVRD-II/tracklets_results/VidVRD_segment30_tracking_results
vidvrd_traj_box_det_th-15-5.zip
, c.t. data0/VidVRD-OpenVoc/vidvrd_traj_box_det_th-15-5.zip
traj RoI features (2048-d)
vidvrd_traj_roi_gt.zip
, c.t. data0/scene_graph_benchmark/output/VidVRDtest_gt_traj_features_seg30
vidvrd_traj_roi_gt_trainset
, c.t. data0/scene_graph_benchmark/output/VidVRD_gt_traj_features_seg30
vidvrd_traj_roi_det.zip
, c.t. data0/scene_graph_benchmark/output/VidVRD_traj_features_seg30
vidvrd_traj_roi_det_th-15-5.zip
, c.t. data0/scene_graph_benchmark/output/VidVRD_traj_features_seg30_th-15-5
traj embds (256-d, and these are all filtered by th-15-5)
vidvrd_traj_emb_gt.zip
, c.t. data0/ALPRO/extract_features_output/VidVRDtest_seg30_TrajFeatures256_gt
vidvrd_traj_emb_gt_trainset.zip
, c.t. data0/ALPRO/extract_features_output/vidvrd_seg30_TrajFeatures256_gt
vidvrd_traj_emb_det.zip
, c.t. data0/ALPRO/extract_features_output/vidvrd_seg30_TrajFeatures256
Download the above data and format as, e.g.,
data0/
| ALPRO/-------------------------------------------------------------------------------------------------------------(num_folders:1, num_files=0),num_videos=0
| | extract_features_output/---------------------------------------------------------------------------------------(num_folders:3, num_files=1),num_videos=0
| | | VidVRDtest_seg30_TrajFeatures256_gt/------------------------------------------------------------------(num_folders:0, num_files=2884),num_videos=200
| | | vidvrd_seg30_TrajFeatures256/-----------------------------------------------------------------------(num_folders:0, num_files=18348),num_videos=1000
| | | vidvrd_seg30_TrajFeatures256_gt/----------------------------------------------------------------------(num_folders:0, num_files=5855),num_videos=800
| scene_graph_benchmark/---------------------------------------------------------------------------------------------(num_folders:1, num_files=0),num_videos=0
| | output/--------------------------------------------------------------------------------------------------------(num_folders:6, num_files=0),num_videos=0
| | | VidVRD_gt_traj_features_seg30/------------------------------------------------------------------------(num_folders:0, num_files=5855),num_videos=800
| | | VidVRD_traj_features_seg30_th-15-5/-----------------------------------------------------------------(num_folders:0, num_files=18348),num_videos=1000
| | | VidVRD_traj_features_seg30/-------------------------------------------------------------------------(num_folders:0, num_files=18348),num_videos=1000
| | | VidVRDtest_gt_traj_features_seg30/--------------------------------------------------------------------(num_folders:0, num_files=2884),num_videos=200
| | | VidVRDtest_tracking_results_gt/-----------------------------------------------------------------------(num_folders:0, num_files=2884),num_videos=200
| | | VidVRD_tracking_results_gt/---------------------------------------------------------------------------(num_folders:0, num_files=5855),num_videos=800
| VidVRD-II/---------------------------------------------------------------------------------------------------------(num_folders:1, num_files=0),num_videos=0
| | tracklets_results/---------------------------------------------------------------------------------------------(num_folders:2, num_files=0),num_videos=0
| | | VidVRD_segment30_tracking_results_th-15-5/----------------------------------------------------------(num_folders:0, num_files=18348),num_videos=1000
| | | VidVRD_segment30_tracking_results/------------------------------------------------------------------(num_folders:0, num_files=18348),num_videos=1000
| VidVRD_VidOR/------------------------------------------------------------------------------------------------------(num_folders:2, num_files=0),num_videos=0
| | vidvrd-dataset/------------------------------------------------------------------------------------------------(num_folders:2, num_files=0),num_videos=0
| | | train/-------------------------------------------------------------------------------------------------(num_folders:0, num_files=800),num_videos=800
| | | test/--------------------------------------------------------------------------------------------------(num_folders:0, num_files=200),num_videos=200
| | vidor-dataset/-------------------------------------------------------------------------------------------------(num_folders:0, num_files=0),num_videos=0
We backup the video data here in case the official link not work.
Pre-prepared traj data (MEGA cloud link). It contains the following files:
traj bbox
data0/VidVRD-II/tracklets_results/VidORtrainVideoLevel_tracking_results_gt_th-15-5
data0/VidVRD-II/tracklets_results/VidORtrainVideoLevel_tracking_results_th-15-5
(is unziping on SMU-server)data0/VidVRD-II/tracklets_results/VidORvalVideoLevel_tracking_results_gt
data0/VidVRD-II/tracklets_results/VidORvalVideoLevel_tracking_results_th-15-5
NOTE: for traj_gt on val set, it is for eval on SGCls & PredCls, and it's not filtered by th-15-5 (in order to get high recall). for traj on train set (both det & gt), we apply th-15-5 filtering to get high quality training samples. (same as bellow)
traj RoI features (2048-d)
data0/scene_graph_benchmark/output/VidORtrain_gt_traj_features_th-15-5
data0/scene_graph_benchmark/output/VidORtrain_traj_features_th-15-5
linkdata0/scene_graph_benchmark/output/VidORval_gt_traj_features
data0/scene_graph_benchmark/output/VidORval_traj_features_th-15-5
traj embds (256-d)
data0/ALPRO/extract_features_output/VidOR_TrajFeatures256_gt_th-15-5
/home/gkf/project/ALPRO/extract_features_output/VidOR_TrajFeatures256_th-15-5
(is uploading to OneDrive)data0/ALPRO/extract_features_output/VidORval_TrajFeatures256_gt
data0/ALPRO/extract_features_output/VidORval_TrajFeatures256
NOTEs:
The det traj data (bbox & RoI features & embds) on train-set is only used for TrajCls module. For RelationCls module, we use gt traj data for traing.
Because: 1) The det_traj on train set is very dense and heavy, it cause too much computation resource. 2) The det_traj data contains many low quality samples (although after th-15-5 filtering) 3) The VidOR's taining set is very large (7k videos), and the gt traj data is almost enough for training (unlike VidVRD which requires det data)
Using det_traj data for further training supplementation will be leave as future work.
TrajCls_VidOR.zip
, hereFirst add the env path:
export PYTHONPATH=$PYTHONPATH:"/your/path/OpenVoc-VidVRD/"
refer to the commands in tools/train_traj_cls_both.py
, for both VidVRD & VidOR datasets, e.g.,
CUDA_VISIBLE_DEVICES=3 python tools/train_traj_cls_both.py \
--dataset_class VidVRDTrajDataset \
--model_class OpenVocTrajCls_NoBgEmb \
--cfg_path experiments/TrajCls_VidVRD/NoBgEmb/cfg_.py \
--output_dir experiments/TrajCls_VidVRD/NoBgEmb \
--save_tag bs128
NOTE:
We segment each video in to segments (30 frames each segment) with 15 frames overlap
For VidVRD, we assign labels in the VidVRDTrajDataset
. And the dataloader's loop is w.r.t. segment
But for VidOR, , note that
tools/VidOR_label_assignment.py
for your reference.refer to the commands in tools/eval_traj_cls_both.py
refer to the commands in tools/VidVRD_label_assignment.py
, e.g.,
python tools/VidVRD_label_assignment.py \
--traj_len_th 15 \
--min_region_th 5 \
--vpoi_th 0.9 \
--cache_tag PredSplit_v2_FullySupervise \
--is_save
e.g., refer to the commands in tools/train_relation_cls.py
for other settings (ablation studies)
### Table-2 (RePro with both base and novel training data) (RePro_both_BaseNovel_training)
# stage-1 (A-100 24G memory, 50 epochs total 3.5 hour)
TOKENIZERS_PARALLELISM=false CUDA_VISIBLE_DEVICES=1 python tools/train_relation_cls.py \
--use_gt_only_data \
--model_class AlproPromptTrainer_Grouped \
--train_dataset_class VidVRDGTDatasetForTrain_GIoU \
--eval_dataset_class VidVRDUnifiedDataset_GIoU \
--cfg_path experiments/RelationCls_VidVRD/RePro_both_BaseNovel_training/stage1/cfg_.py \
--output_dir experiments/RelationCls_VidVRD/RePro_both_BaseNovel_training/stage1/ \
--eval_split_traj all \
--eval_split_pred all \
--save_tag bsz32
# stage-2 (A-100 15G, (about 14791 M), 50 epochs total 2.5 hour )
TOKENIZERS_PARALLELISM=false CUDA_VISIBLE_DEVICES=0 python tools/train_relation_cls.py \
--model_class OpenVocRelCls_stage2_Grouped \
--train_dataset_class VidVRDUnifiedDataset_GIoU \
--eval_dataset_class VidVRDUnifiedDataset_GIoU \
--cfg_path experiments/RelationCls_VidVRD/RePro_both_BaseNovel_training/stage2/cfg_.py \
--output_dir experiments/RelationCls_VidVRD/RePro_both_BaseNovel_training/stage2/ \
--save_tag bsz32
refer to tools/eval_relation_cls.py
for different test settings