Closed czhxiaohuihui closed 4 years ago
The goal of the text generation part is also optimizing the learning of visual embeddings. It encourages that the learned visual representation should also have the ability to generate sentences that are close to the ground-truth captions.
For the image representation part, first get V, through RRR(GCNs) get V*, and through GSR (a GRU) get m_k (as the final representation for the whole image).
m_k and sentence embedding compute a retrieval loss. V* is used to generation a caption, and compute a generation loss. I am confused about the text generation process, and paper https://arxiv.org/abs/1712.02036 has a similar approach.
In my view, this text generation part provides extra loss for the model to train and to extract more information from images. Can you explain it clearly to me? Thanks a lot.