aimagelab / meshed-memory-transformer

Meshed-Memory Transformer for Image Captioning. CVPR 2020
BSD 3-Clause "New" or "Revised" License
517 stars 136 forks source link

From features of new images to M2 transformer #7

Closed alesolano closed 4 years ago

alesolano commented 4 years ago

First of all, congrats for your work and thanks for releasing the code! 😄

Following #2 and #5, I'm trying to run the network on a new set of images. To get the image features I went to the bottom-up attention repo you suggested here, using the Faster-R-CNN-ResNet101 model with these weights.

My problem is the following: how to transform the outputs of this feature extractor into the format you require?

Following the Readme and code, I understand that you need to express the features as a Nx2048 tensor. Following this line, I understand that you also need a cls_prob vector to sort your feature vector.

Now, I took the blob res5c for the features and cls_prob for the probabilities, but the dimensions are not quite as I expected. res5c has dimension Nx2048x14x14, so the 14x14 should be mapped into one number I guess. And cls_prob has Nx1061 which is not coherent with the rest.

Am I missing something?

Thanks!

marcellacornia commented 4 years ago

You shouldn't have the spatial dimensions for each image region. Did you try to take the output of the average pooling?

The cls_prob tensor should have a shape equal to (N, 1601) where 1600 (plus 1 for the background) is the number of possible detection classes.

TrungThanhTran commented 4 years ago

@marcellacornia thank Marcellacornia for the information. @alesolano, I wrote these lines of code, and it worked fine when I attached it with the M2 model. Please try it if you need:

    # Original Resnet50 
    resnet = resnet50(pretrained=True)       

     # Remove linear and pool layers (since we're not doing classification)
    modules = list(resnet.children())[:-2]
    self.resnet = nn.Sequential(*modules)
    self.dropout = nn.Dropout(0.5)

    # Resize image to fixed size to allow input images of variable size
    self.adaptive_pool = nn.AdaptiveAvgPool2d((encoded_image_size, encoded_image_size))

    # Add layer
    self.avgpool = nn.AvgPool2d(encoded_image_size)
    self.affine_embed = nn.Linear(2048, embed_dim)
    self.conv_1d = nn.Conv2d(2048, embed_dim, kernel_size=1)
    `
alesolano commented 4 years ago

Thanks @marcellacornia, @TranTony for the responses! I'll try that last snippet of code tomorrow and I'll post here the outcome so we can eventually close this issue.

UPDATE: Yes @marcellacornia, that's exactly what I needed. The pool5_flat layer has an output of Nx2048. Thanks!