I have a question relating to the image feature map. In the paper, you mentioned that you used ResNet34 pretrained on ImageNet as the image encoder. Could you please provide more details about the layers that you have used? Did you remove the last global average pooling and FC layers of the ResNet backbone?
I assume that you finally encoded each input image to only one feature maps F, since later you calculated an attention map and mapped it back to this feature map F for each step. If so, you should have added some decoder layers after the last Conv. layers block of the ResNet-34, right? Please correct me if I understand it wrongly. Thanks!
Since the code have been released, you could find out that we use both the 2D feature map and flattened one of ResNet34. No additional layers are used.
Hi,
Congrat to this impressive work.
I have a question relating to the image feature map. In the paper, you mentioned that you used ResNet34 pretrained on ImageNet as the image encoder. Could you please provide more details about the layers that you have used? Did you remove the last global average pooling and FC layers of the ResNet backbone?
I assume that you finally encoded each input image to only one feature maps F, since later you calculated an attention map and mapped it back to this feature map F for each step. If so, you should have added some decoder layers after the last Conv. layers block of the ResNet-34, right? Please correct me if I understand it wrongly. Thanks!
Best wishes