Generating multi-sentence descriptions for videos (visual relevance, discourse-based coherence across the sentence in the paragraph)
difficulties of having relevant, less redundant, as well as coherent generated sentence
Goal
more coherent and less repetitive paragraph captions than baseline methods
build a model that can span over multiple video segments and capture longer range dependencies
In this paper
Memory module for a highly summarized memory state from the video segments and the sentence history
works as a memory updater that updates its memory state(as a container of the highly summarized video segments and caption history information) using both the current inputs and previous memory state
transformer-based model that uses a shared encoder-decoder architecture augmented with an external memory module
MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning
논문
Problem
Goal
In this paper