Image Feature Map - Githubissues

Hi,

Congrat to this impressive work.

I have a question relating to the image feature map. In the paper, you mentioned that you used ResNet34 pretrained on ImageNet as the image encoder. Could you please provide more details about the layers that you have used? Did you remove the last global average pooling and FC layers of the ResNet backbone?

I assume that you finally encoded each input image to only one feature maps F, since later you calculated an attention map and mapped it back to this feature map F for each step. If so, you should have added some decoder layers after the last Conv. layers block of the ResNet-34, right? Please correct me if I understand it wrongly. Thanks!

Best wishes

OpenDriveLab / TCP

Image Feature Map #7