But it seems that in the code for the s2t_dualinputtransformer.py, the default is the opposite, meaning that the teacher states are coming from the speech representations.
The correct behavior (teacher=text) can be used with setting the argument cross_attentive_loss_reverse, but this is always set to False.
To induce the behavior according the paper, swapping lines 529-534 with 536-541 should work.
To Reproduce
I did not try to reproduce any error, I think both ways of applying CAR would work but the one described in the paper is not the default.
🐛 Bug
I think there is a bug in how the Cross-Attention Regularization (CAR) is applied in the joint speech-to-text task.
According to the equation (3) in the paper "Improving Speech Translation by Understanding and Learning from the Auxiliary Text Translation Task", the teacher states in the Cross-Attention Regularization (CAR) are coming from the text representations.
But it seems that in the code for the
s2t_dualinputtransformer.py
, the default is the opposite, meaning that the teacher states are coming from the speech representations.The correct behavior (teacher=text) can be used with setting the argument
cross_attentive_loss_reverse
, but this is always set to False.To induce the behavior according the paper, swapping lines 529-534 with 536-541 should work.
To Reproduce
I did not try to reproduce any error, I think both ways of applying CAR would work but the one described in the paper is not the default.