PyTorch code and dataset for our ACM MM 2021 paper "State-aware Video Procedural Captioning" by Taichi Nishimura, Atsushi Hashimoto, Yoshitaka Ushiku, Hirotaka Kameko, and Shinsuke Mori.
Video procedural captioning (VPC), which generates procedural text from instructional videos, is an essential task for scene understanding and real-world applications. The main challenge of VPC is to describe how to manipulate materials accurately. This paper focuses on this challenge by designing a new VPC task, generating a procedural text from the clip sequence of an instructional video and material list. In this task, the state of materials is sequentially changed by manipulations, yielding their state-aware visual representations (e.g., eggs are transformed into cracked, stirred, then fried forms). The essential difficulty is to convert such visual representations into textual representations; that is, a model should track the material states after manipulations to better associate the cross-modal relations. To achieve this, we propose a novel VPC method, which modifies an existing textual simulator for tracking material states as a visual simulator and incorporates it into a video captioning model. Our experimental results show the effectiveness of the proposed method, which outperforms state-of-the-art video captioning models. We further analyze the learned embedding of materials to demonstrate that the simulators capture their state transition.
Clone this repository
git clone https://github.com/misogil0116/svpc
cd svpc
Prepare feature files
Download features.tar.gz from Google drive. The features/ directory stores ResNet + BN-Inception features for each video.
features
├── testing
├── training
├── validation
└── yc2
We give examples on how to perform training and inference.
bash scripts/train.sh MODEL_TYPE TEMP_PARAM, LAMBDA_PARAM, CHECKPOINT_DIR, FEATURE_DIR, DURATION_PATH
MODEL_TYPE
can be one of [vivt, viv, vi, v]
, see details below.
TEMP_PARAM
and LAMBDA_PARAM
is a gumbel softmax temperature parameter and lambda parameter, respectively (TEMP_PARAM=0.5
and LAMBDA_PARAM=0.5
work well in our experiments).
CHECKPOINT_DIR
, FEATURE_DIR
, and DURATION_DIR
is checkpoint directory, feature directory, and duration csv filepath, respectively.
MODEL_TYPE | Description |
---|---|
vivt | +Visual simulator+Textual re-simulator |
viv | +Visual simulator |
vi | Video+Ingredient |
v | Video |
To train VIVT model:
scripts/train.sh vivt 0.5 0.5 /path/to/model/checkpoint/ /path/to/features/ /path/to/duration_frame.csv
Evaluate trained model on word-overlap evaluation (BLEU, METEOR, CIDEr-D, and ROUGE-L)
scripts/eval_caption.sh MODEL_TYPE CHECKPOINT_PATH FEATURE_DIR DURATION_PATH
Note that you should specify checkpoint file (.chkpt
) for CHECKPOINT_PATH
.
Generated captions are saved at /path/to/model/checkpoint/MODEL_TYPE_test_greedy_pred_test.json
.
This file is used for ingredient prediction evaluation.
Evaluate ingredient prediction
scripts/eval_ingredient_f1.sh MODEL_TYPE CAPTION_PATH
The results should be comparable with the results shown at Table 4 of the paper.
Dump the learned embedding of ingredients
scripts/dump_embeddings.sh MODEL_TYPE CHECKPOINT_PATH FEATURE_DIR DURATION_PATH
This script generates ./MODEL_TYPE_step_embedding_dict.pkl
, which consists of material embedding at each step.
You can download them from here
You can evaluate this by converting generated caption file (CHECKPOINT_PATH
) into csv format that MIL-NCE requests. See here for additional information.
You can access them here.
The annotated ingredients are stored to the json files (see ingredients
keys).
If you use this code for your research, please cite our paper:
@inproceedings{taichi2021acmmm,
title={State-aware Video Procedural Captioning},
author={Taichi Nishimura and Atsushi Hashimoto and Yoshitaka Ushiku and Hirotaka Kameko and Shinsuke Mori},
booktitle={ACMMM},
pages={1766--1774},
year={2021}
}
This code is based on MART
taichitary [at] gmail.com.