LLLogen / VSDcode

1 stars 0 forks source link

Visual Description Description

We take the VLT5 model as the bassline.

Note Please go into VLT5 and follow the README there for Pretrained Models and Feature Extraction.

Setup

# Create python environment (optional)
conda create -n vsd python=3.7
source activate vsd

# Install python dependencies
pip install -r requirements.txt

# For captioning evaluation
python -c "import language_evaluation; language_evaluation.download('coco')"

# To run the SPICE evaluation, you need switch the JDK to 1.8

Code structure

# Store images, features, and annotations
./datasets

# Image feature extraction
./feature_extraction

# Train VL-T5
./VLModel/
    src/
        modeling_t5.py modeling_bart.py                       <= VL-T5/VL-BART model classes
        caption_sp.py, vrd_caption.py                         <= fine-tuning
        param.py                                              <= (argparse) configuration
        tokenization.py                                       <= custom tokenizer
        utils.py, dist_utils.py                               <= utility functions
    snap/                                                     <= store weight checkpoints

Pretrained Models

Pre-extracated Image Features

Run

## When all the data, pretrained models, and image features get ready, you can train the model:
bash ./baseline.sh gpu_num

Acknowledgement

This repo is adapted from VLT5.

Reference

Please cite our paper if you use our models or data in your project.

This repository cotains code and data for our paper Visual Spatial Description: Controlled Spatial-Oriented Image-to-Text Generation

@inproceedings{zhao2022vsd,
  title     = {Visual Spatial Description: Controlled Spatial-Oriented Image-to-Text
               Generation},
  author    = {Yu Zhao and
               Jianguo Wei and
               Zhichao Lin and
               Yueheng Sun and
               Meishan Zhang and
               Min Zhang},
  booktitle = {EMNLP},
  year      = {2022}
}