We take the VLT5 model as the bassline.
Note Please go into VLT5 and follow the README there for Pretrained Models and Feature Extraction.
# Create python environment (optional)
conda create -n vsd python=3.7
source activate vsd
# Install python dependencies
pip install -r requirements.txt
# For captioning evaluation
python -c "import language_evaluation; language_evaluation.download('coco')"
# To run the SPICE evaluation, you need switch the JDK to 1.8
# Store images, features, and annotations
./datasets
# Image feature extraction
./feature_extraction
# Train VL-T5
./VLModel/
src/
modeling_t5.py modeling_bart.py <= VL-T5/VL-BART model classes
caption_sp.py, vrd_caption.py <= fine-tuning
param.py <= (argparse) configuration
tokenization.py <= custom tokenizer
utils.py, dist_utils.py <= utility functions
snap/ <= store weight checkpoints
snap/
from Google Drive
gdrive download 1_SBj4sZ0gUqfBon1gFBiNRAmfHv5w_ph --recursive
## When all the data, pretrained models, and image features get ready, you can train the model:
bash ./baseline.sh gpu_num
This repo is adapted from VLT5.
Please cite our paper if you use our models or data in your project.
This repository cotains code and data for our paper Visual Spatial Description: Controlled Spatial-Oriented Image-to-Text Generation
@inproceedings{zhao2022vsd,
title = {Visual Spatial Description: Controlled Spatial-Oriented Image-to-Text
Generation},
author = {Yu Zhao and
Jianguo Wei and
Zhichao Lin and
Yueheng Sun and
Meishan Zhang and
Min Zhang},
booktitle = {EMNLP},
year = {2022}
}