Potential bug in `cross_attentive_regularization` for joint-s2t

🐛 Bug

I think there is a bug in how the Cross-Attention Regularization (CAR) is applied in the joint speech-to-text task.

According to the equation (3) in the paper "Improving Speech Translation by Understanding and Learning from the Auxiliary Text Translation Task", the teacher states in the Cross-Attention Regularization (CAR) are coming from the text representations.

But it seems that in the code for the s2t_dualinputtransformer.py, the default is the opposite, meaning that the teacher states are coming from the speech representations.

The correct behavior (teacher=text) can be used with setting the argument cross_attentive_loss_reverse, but this is always set to False.

To induce the behavior according the paper, swapping lines 529-534 with 536-541 should work.

To Reproduce

I did not try to reproduce any error, I think both ways of applying CAR would work but the one described in the paper is not the default.

facebookresearch / fairseq

Potential bug in `cross_attentive_regularization` for joint-s2t #4845

🐛 Bug

To Reproduce