airsplay / lxmert

PyTorch code for EMNLP 2019 paper "LXMERT: Learning Cross-Modality Encoder Representations from Transformers".
MIT License
923 stars 157 forks source link

visual_attention_mask is set to None during pre-training? #91

Closed zhmd closed 3 years ago

zhmd commented 3 years ago

Thanks for this great repo!

I'm curious that for pre-training, it seems visual_attention_mask is never passed to LXRTModel, as shown in the few lines below? https://github.com/airsplay/lxmert/blob/0db1182b9030da3ce41f17717cc628e1cd0a95d5/src/lxrt/modeling.py#L924-L927

The signature of LXRTModel is defined here:

https://github.com/airsplay/lxmert/blob/0db1182b9030da3ce41f17717cc628e1cd0a95d5/src/lxrt/modeling.py#L845-L846

If I'm understanding correctly, the visual_attention_mask should be the feat_mask from this line: https://github.com/airsplay/lxmert/blob/0db1182b9030da3ce41f17717cc628e1cd0a95d5/src/pretrain/lxmert_pretrain.py#L178 which is saved in object_labels['feat'][1]?

@airsplay , do you care to clarify a bit why the visual_attention_mask is not used in the LXRT encoder?