A little problem about the text generation part

KunpengLi1994 / VSRN

PyTorch code for ICCV'19 paper "Visual Semantic Reasoning for Image-Text Matching"

288 stars 47 forks source link

For the image representation part, first get V, through RRR(GCNs) get V*, and through GSR (a GRU) get m_k (as the final representation for the whole image).

m_k and sentence embedding compute a retrieval loss. V* is used to generation a caption, and compute a generation loss. I am confused about the text generation process, and paper https://arxiv.org/abs/1712.02036 has a similar approach.

In my view, this text generation part provides extra loss for the model to train and to extract more information from images. Can you explain it clearly to me? Thanks a lot.

KunpengLi1994 / VSRN

A little problem about the text generation part #1