aimagelab / meshed-memory-transformer

Meshed-Memory Transformer for Image Captioning. CVPR 2020
BSD 3-Clause "New" or "Revised" License
517 stars 136 forks source link

Architecture related doubt #36

Closed shreyanshchordia closed 3 years ago

shreyanshchordia commented 3 years ago

Are you feeding the raw image (y y 3 dimension) regions or do you feed some deep CNN generated embeddings to your Encoder Decoder network?

marcellacornia commented 3 years ago

Hi @shreyanshchordia,

to encode images, we used the features coming from Faster R-CNN trained on Visual Genome (here the source code we used to extract the features: https://github.com/peteanderson80/bottom-up-attention). For each detected region, the model returns a 2048-d feature vector. The input to our encoder is therefore a set of 2048-d feature vectors, each corresponding to a specific region of the image.

Hope this helps!

shreyanshchordia commented 3 years ago

Thank you so much for the reply