Awesome-Image Captioning

A paper list of image captioning as supplementary reference to this short survey. Based on this survey, we combed the papers and its codes in the field of IC in recent years.

This paper list is organized as follows:

Ⅰ. the existing surveys in IC field

Ⅱ. three main directions of current IC:

Nowadays, mainstream of IC model is heterogenous encoder-decoder architecture with three major improvement directions.

visual feature: advancement of encoder(CNN)

attention mechanism: changes in the attended source; modification of the architecture of the attention module

visual and language structure: explorations of structural inductive bias

Ⅲ. Transformer & homogenous architecture

Many remarkable improvements in performance have achieved after the advent of Transformer.
Thanks to the architectural advantages of Transformer, a promising pure Transformer-based homogeneous encoder-decoder captioner is around the corner.

Ⅳ. large scale pretraining

Motivated by NLP , researchers in the vision-language domain also proposed to train the large-scale Transformer architectures. Some of these multi-modal large-scale pre-training models can also be used for IC and have achieved much better performances than small-scale ones.

**If need, I'm glad to supplement other paper information such as journal reference and continue to update latest awesome works. However, I'm busy with other issue currently and could not update this paper list in recent time.

Most of the journal reference can be found at ArXiv(since the pdf link I've already provided) and meanwhile I recommend this webside to search source code.

Survey

A comprehensive survey of deep learning for image captioning. MD Zakir Hossain. | [pdf]
From Show to Tell: A Survey on Image Captioning. Matteo Stefanini. | [pdf]
Transformers in vision: A survey. Salman Khan. | [pdf]
Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. Pengfei Liu.|[pdf]
On the Opportunities and Risks of Foundation Models. Rishi Bommasani. | [pdf]
A Survey on Vision Transformer. Kai Han. | [pdf]

Current Image Captioning

Classic Encoder-Decoder Captioner

Sequence to sequence learning with neural networks. Ilya Sutskever. | [NIPS'14]| [pdf]
Cider: Consensus-based image description evaluation. Ramakrishna Vedantam. |evaluation metrics, CIDEr| [pdf]
Reinforcing an Image Caption Generator using Off-line Human Feedback. Paul Hongsuck Seo. | [pdf]
Interactive Dual Generative Adversarial Networks for Image Captioning. Junhao Liu. | [pdf]
Dependent Multi-Task Learning with Causal Intervention for Image Captioning. Wenqing Chen.|[pdf]
Perturb, Predict & Paraphrase: Semi-Supervised Learning using Noisy Student for Image Captioning. Arjit Jain. |[pdf]
Recurrent Relational Memory Network for Unsupervised Image Captioning. Dan Guo. | [pdf]
Memory-Augmented Image Captioning. Zhengcong Fei.| [pdf]
MemCap: Memorizing Style Knowledge for Image Captioning. Wentian Zhao. | [pdf]
Human Consensus-Oriented Image Captioning. Ziwei Wang. | [pdf]
Self-critical sequence training for image captioning. Steven J Rennie. |reinforcement learning-based strategy| [pdf]

Visual Feature -- CNN

Show and tell: A neural image caption generator. Oriol Vinyals. |image-level; GoogLeNet & pretrained by classification on ImageNet|[CVPR'15]| [pdf]
Show, attend and tell: Neural image caption generation with visual attention. Kelvin Xu. | grid feature; classification; hard attention & soft attention | [pdf]
Bottom-up and top-down attention for image captioning and visual question answering. Peter Anderson. |regional feature; object detection| [pdf]
Neural baby talk. Jiasen Lu. |attribute classification|[pdf]
Deconfounded image captioning: A causal retrospect. Xu Yang. | DIC; a rethink on dataset bias| [pdf]
Women also snowboard: Overcoming bias in captioning models. Lisa Anne Hendricks. | solution to dataset bias | [pdf]
In defense of grid features for visual question answering. Huaizu Jiang. | [pdf]

Attention Mechanism

Neural machine translation by jointly learning to align and translate. Dzmitry Bahdanau. | for the first time introduced attention into NLP field | [pdf]
Image captioning with semantic attention. Quanzeng You. |directly attend to semantic tags| [pdf]
Boosting image captioning with attributes. Ting Yao. | [pdf]
Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. Long Chen. | features from multi-channels| [pdf]
Recurrent fusion network for image captioning. Wenhao Jiang. |multi-CNNs| [pdf]
Reflective decoding network for image captioning. Lei Ke. |[pdf]
Look back and predict forward in image captioning. Yu Qin. | [pdf]
Meshed-memory transformer for image captioning. Marcella Cornia. |use different sources as Q,K,V；augmented memory| [pdf]
Attention on attention for image captioning. Lun Huang. | AoA| [pdf]
X-linear attention networks for image captioning. Yingwei Pan. |X-LAN|[pdf]
Show, Recall, and Tell: Image Captioning with Recall Mechanism. Li Wang.|[pdf]

Visual and Language Structure -- Inductive Bias

Exploring visual relationship for image captioning. Ting Yao.|scene graphs|[pdf]
Auto-encoding and distilling scene graphs for image captioning. Xu Yang. | [pdf]
Relational inductive biases, deep learning, and graph networks. Peter W Battaglia. |use GNN to embed relational inductive bias| [pdf]
Say as you wish: Fine-grained control of image caption generation with abstract scene graphs. Shizhe Chen. | [pdf]
Hierarchy parsing for image captioning. Ting Yao. |tree-based encoder|[pdf]
Auto-parsing network for image captioning and visual question answering. Xu Yang. |text pattern |[pdf]
Knowing when to look: Adaptive attention via a visual sentinel for image captioning. Jiasen Lu. |design two modules for vision and non-vision words|[pdf]
Learning to collocate neural modules for image captioning. Xu Yang. |four modules : object, attribute, relation, and function|[pdf]
MAGIC: Multimodal relAtional Graph adversarIal inferenCe for Diverse and Unpaired Text-Based Image Captioning. Wenqiao Zhang. | [pdf]
Image Captioning with Context-Aware Auxiliary Guidance. Zeliang Song. | [pdf]
Consensus Graph Representation Learning for Better Grounded Image Captioning. Wenqiao Zhang. | [pdf]
Feature Deformation Meta-Networks in Image Captioning of Novel Objects. Tingjia Cao. | [pdf]

Transformer & Homogenous Architecture

Attention is all you need. Ashish Vaswani. |[NIPS'17]|[pdf]
Swin transformer: Hierarchical vision transformer using shifted windows. Ze Liu. | visual encoder is a pre-trained vision Transformer| [pdf]
Tree transformer: Integrating tree structures into self-attention. Yau-Shian Wang. |[pdf]
Partially Non-Autoregressive Image Captioning. Zhengcong Fei. |generates in word groups; Transformer-based|[pdf]
Improving Image Captioning by Leveraging Intra- and Inter-layer Global Representation in Transformer Network. Jiayi Ji.|[pdf]
Dual-Level Collaborative Transformer for Image Captioning. Yunpeng Luo. | [pdf]
Learning Long- and Short-Term User Literal-Preference with Multimodal Hierarchical Transformer Network for Personalized Image Caption. Wei Zhang. | [pdf]
TCIC: Theme Concepts Learning Cross Language and Vision for Image Captioning. Zhihao Fan. | [pdf]
Non-Autoregressive Image Captioning with Counterfactuals-Critical Multi-Agent Learning. Longteng Guo. | [pdf]

Large Scale Pretraining

Bert: Pre-training of deep bidirectional transformers for language understanding. Jacob Devlin. |[pdf]
Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Jiasen Lu. |[NIPS'19]|[pdf]
E2e-vlp: End-to-end vision-language pre-training enhanced by visual learning. Haiyang Xu. | [pdf]
Language models are unsupervised multitask learners. Alec Radford. |pre-train, prompt, and predict|[pdf]
Language models as knowledge bases? Fabio Petroni. | [pdf]
Oscar: Object-semantics aligned pre-training for vision-language tasks. Xiujun Li. | [pdf]
Uniter: Learning universal image-text representations. Yen-Chun Chen.| Masked Region Classification, Masked Region Feature Regression, and Masked Region Classification | [pdf]
Align before fuse: Vision and language representation learning with momentum distillation. Junnan Li. | [pdf]
Open-vocabulary object detection using captions. Alireza Zareian.| [pdf]
Zero-shot text-to-image generation. Aditya Ramesh. | [pdf]
VIVO: Visual Vocabulary Pre-Training for Novel Object Captioning. Xiaowei Hu. | [pdf]
Unified Vision-Language Pre-Training for Image Captioning and VQA. Luowei Zhou. | [pdf]
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training. Linjie Li. | [pdf]
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval. Max Bain.|[pdf]
VIOLET:End-to-End Video-Language Transformers with Masked Visual-toke Modeling. Tsu-jiu Fu. | [pdf]
SWINBERT: End-to-End Transformers with Sparse Attention for Video Captioning. Kevin Lin.| [pdf]
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text. Hassan Akbari.| [pdf]
BEVT: BERT Pretraining of Video Transformers. Rui Wang.|[pdf]
Prompt
Unifying vision-and-language tasks via text generation. Jaemin Cho. |unifies various vision-language tasks.| [pdf]
Multimodal few-shot learning with frozen language models. Maria Tsimpoukelli. | [pdf]
Simvlm: Simple visual language model pretraining with weak supervision. Zirui Wang.| [pdf]
Vqa: Visual question answering. Stanislaw Antol.|VQA|[pdf]
Learning transferable visual models from natural language supervision. Alec Radford. |CLIP| [ICML'21]|[pdf]

SjokerLily / awesome-image-captioning

readme