Closed singh-hrituraj closed 4 years ago
Yep. It is intended and used to maximize the GPU utilization.
The same thing happens in BERT pre-training, where the NSP (next sentence prediction) task is applied. When the two sentences do not match each other, the loss of the masked language model is still optimized. The idea in BERT is to "recovered using the single-sentence" as you clarified.
Thanks for the clarification. And thanks a lot for the amazing code!
In this code snippet of file lxrt.modelling.py https://github.com/airsplay/lxmert/blob/9b8f0ffd56dba5490f37af9c14f3c112cc10358c/src/lxrt/modeling.py#L940-L953 masked_lm_loss is still being optimized when the label is not matched i.e. even when the image is not related to 'caption/text' since there is no check/mask multiplication.
This might not be exactly a bug considering the masked word can still be recovered using the language stream only but was it intended? Or am I missing some other part of the code?