airsplay / lxmert

PyTorch code for EMNLP 2019 paper "LXMERT: Learning Cross-Modality Encoder Representations from Transformers".
MIT License
935 stars 158 forks source link

masked_lm_loss being optimised even when label is not matched ? #51

Closed singh-hrituraj closed 4 years ago

singh-hrituraj commented 4 years ago

In this code snippet of file lxrt.modelling.py https://github.com/airsplay/lxmert/blob/9b8f0ffd56dba5490f37af9c14f3c112cc10358c/src/lxrt/modeling.py#L940-L953 masked_lm_loss is still being optimized when the label is not matched i.e. even when the image is not related to 'caption/text' since there is no check/mask multiplication.

This might not be exactly a bug considering the masked word can still be recovered using the language stream only but was it intended? Or am I missing some other part of the code?

airsplay commented 4 years ago

Yep. It is intended and used to maximize the GPU utilization.

The same thing happens in BERT pre-training, where the NSP (next sentence prediction) task is applied. When the two sentences do not match each other, the loss of the masked language model is still optimized. The idea in BERT is to "recovered using the single-sentence" as you clarified.

singh-hrituraj commented 4 years ago

Thanks for the clarification. And thanks a lot for the amazing code!