KunpengLi1994 / VSRN

PyTorch code for ICCV'19 paper "Visual Semantic Reasoning for Image-Text Matching"
288 stars 47 forks source link

A little problem about the text generation part #1

Closed czhxiaohuihui closed 4 years ago

czhxiaohuihui commented 4 years ago

For the image representation part, first get V, through RRR(GCNs) get V*, and through GSR (a GRU) get m_k (as the final representation for the whole image).

m_k and sentence embedding compute a retrieval loss. V* is used to generation a caption, and compute a generation loss. I am confused about the text generation process, and paper https://arxiv.org/abs/1712.02036 has a similar approach.

In my view, this text generation part provides extra loss for the model to train and to extract more information from images. Can you explain it clearly to me? Thanks a lot.

KunpengLi1994 commented 4 years ago

The goal of the text generation part is also optimizing the learning of visual embeddings. It encourages that the learned visual representation should also have the ability to generate sentences that are close to the ground-truth captions.