Closed shreyanshchordia closed 3 years ago
Hi @shreyanshchordia,
to encode images, we used the features coming from Faster R-CNN trained on Visual Genome (here the source code we used to extract the features: https://github.com/peteanderson80/bottom-up-attention). For each detected region, the model returns a 2048-d feature vector. The input to our encoder is therefore a set of 2048-d feature vectors, each corresponding to a specific region of the image.
Hope this helps!
Thank you so much for the reply
Are you feeding the raw image (y y 3 dimension) regions or do you feed some deep CNN generated embeddings to your Encoder Decoder network?