airsplay / lxmert

PyTorch code for EMNLP 2019 paper "LXMERT: Learning Cross-Modality Encoder Representations from Transformers".
MIT License
933 stars 158 forks source link

Shapes of inputs and output in entry.py #61

Closed ecekt closed 4 years ago

ecekt commented 4 years ago

Hello, I wanted to check with you if the shapes of my inputs and the output I get are correct (using the entry.py). For a single example whose tokenized sentence length is 30, the shape of input_ids, input_mask and segment_ids is (1,30) and the image features (features and boxes) have (1,36,2048) and (1,36,4), respectively. As the cross-modal output, I get (1,768). Would these be correct? How can I get un-pooled outputs?

Best regards

airsplay commented 4 years ago

You could take lang_output / visn_outputs.

ecekt commented 4 years ago

Thank you for the response. I have just found that I can use the 'mode' argument in the init of LXRTFeatureExtraction.