Authors: Shutao Li, Bin Li, Bin Sun, Yixuan Weng๐
[Contact] If you have any questions, feel free to contact me via (Email).
This repository contains code, models, and other related resources of our paper "Towards Visual-Prompt Temporal Answering Grounding in Instructional Video".
Welcome to the VPTSL Project - We design the text span-based predictor, where the input text question, video subtitles, and visual prompt features are jointly learned with the pre-trained language model for enhancing the joint semantic representations.๐๐
1.10.0
), transformers(4.15.0
), tqdm, accelerate, pandas, numpy, glob, sentencepiece# preparing environment
sudo apt-get install gcc
sudo apt-get install make
wget https://developer.download.nvidia.com/compute/cuda/11.5.1/local_installers/cuda_11.5.1_495.29.05_linux.run
sudo sh cuda_11.5.1_495.29.05_linux.run
# preparing environment
wget -c https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
sudo chmod 777 Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
conda create -n VPTSL python==3.7
conda activate VPTSL
# preparing environment
pip install torch==1.10.0+cu113 torchvision==0.11.1+cu113 torchaudio==0.10.0+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html
pip install tqdm transformers sklearn pandas numpy glob accelerate sentencepiece
You can choice one of them, and unzip it:
MedVidQA.zip, Copyright belongs to NIH
cd ./data
unzip xxxxxx.zip
Then you will see the dataset file folders:
- data
- MedVidQA
- features
- _1NofulQlHs.npy
- _6csIJAWj_s.npy
- ...
- text
- train.json
- val.json
- test.json
- TutorialVQA
- ...
- VehicleVQA.zip
- ...
- Coin
- ...
- Crosstalk
- ...
bash run.sh
All our hyperparameters are saved to
run.sh
file, you can easily reproduce our best results.
python set_model.py --shape large
Our Text Encoder uses the [DeBERTa](microsoft/deberta-v3-base ยท Hugging Face) model (To support longer text), the other layers are initialized randomly.
You can choose a (xsmall/small/base/large) model for train.
python main.py --shape large \
--seed 42 \
--maxlen 1800 \
--epochs 32 \
--batchsize 4 \
--lr 1e-5 \
--highlight_hyperparameter 0.25 \
--loss_hyperparameter 0.1
In this phase, training and testing will be carried out.
In addition, after each round of training, it will be tested in the valid and test sets. In our paper, we report the model with the highest valid set and its score in the test set
@ARTICLE{10552074,
author={Li, Shutao and Li, Bin and Sun, Bin and Weng, Yixuan},
journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
title={Towards Visual-Prompt Temporal Answer Grounding in Instructional Video},
year={2024},
volume={},
number={},
pages={1-18},
keywords={Visualization;Task analysis;Thyroid;Feature extraction;Semantics;Grounding;Location awareness;Temporal answer grounding;instructional video;visual prompt;pre-trained language model},
doi={10.1109/TPAMI.2024.3411045}}
@INPROCEEDINGS{10096391,
author={Li, Bin and Weng, Yixuan and Sun, Bin and Li, Shutao},
booktitle={ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
title={Learning To Locate Visual Answer In Video Corpus Using Question},
year={2023},
volume={},
number={},
pages={1-5},
keywords={Location awareness;Training;Visualization;Natural languages;Signal processing;Predictive models;Benchmark testing;Video corpus;visual answer localization},
doi={10.1109/ICASSP49357.2023.10096391}}
@INPROCEEDINGS{10095026,
author={Weng, Yixuan and Li, Bin},
booktitle={ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
title={Visual Answer Localization with Cross-Modal Mutual Knowledge Transfer},
year={2023},
volume={},
number={},
pages={1-5},
keywords={Location awareness;Visualization;Semantics;Natural languages;Signal processing;Predictive models;Acoustics;Cross-modal;Mutual Knowledge Transfer;Visual Answer Localization},
doi={10.1109/ICASSP49357.2023.10095026}}