Open ereday opened 2 years ago
I would appreciate it greatly if you can let me know what needs to be done in order to perform a proper training, @xutaima . Thanks in advance.
Looking at old issues related to SiMT, I see that you (@jmp84 ) can also help me on this issue. I would appreciate it greatly if you can let me know what needs to be done in order to perform a proper training.
@ereday ... I am trying to reproduce the results for "infinite_lookback" attention. Did you have any results with this? My BLEU scores are very low. I have fixed above bug you have mentioned, but still no improvements. Thanks in advance.
@ereday ... I am trying to reproduce the results for "infinite_lookback" attention. Did you have any results with this? My BLEU scores are very low. I have fixed above bug you have mentioned, but still no improvements. Thanks in advance.
Hello @sathishreddy , Have you found the solution ?
🐛 Bug
@xutaima Following other active/closed issues related to this model, I understand that I should contact you regarding this issue. I am unable to train Multihead Monotonic Attention models (Tried both IL and Hard). There are two main issues here: First, the training command given in the official readme file seems wrong. It does not match with the paper. Searching on the old active/closed issues, I ended up using the following command for training the MMA-Hard variant:
Note that there are several important differences between the command above and readme:
update-freq
based on number of GPUs used during traning. In my case, I use 8 of them, so setting update-freq to 8.max-source-positions, max-target-positions, max-update
?So, my first question is: Could you share the complete & correct training command, please?
More importantly, I'd like to report a very critical bug on the
label_smoothed_cross_entropy_latency_augmented
loss function. In the current implementation [LINK], bothweighted_avera_latency
andhead divergence loss
are weighted using the same coefficientlatency_avg_weight
which is set to 0.0 for MMA-hard model. This means, no latency regularisation term is used in model training.I trained two models using the above command: before and after fixing the bug in the loss function on my local.
The model trained with buggy loss function ended up with an acceptable bleu score for WMT15 de-en (still 1.5 lower than what you reported in the paper for some reason, but it is OK for now). However, because it was trained with no latency regularization term, model has terrible latency scores:
The model trained after bug-fix* has much better latency scores. However, it made the bleu score drop alot:
As you can see, I tried many different various lambda parameters and none of them worked well.
According to the Table-6 in your paper, when I set lambda to 0.1, I should be able to get BLEU scores of 28.5 and DAL scores of 10.83. My second question is: Could you please tell me what do I need to get similar results?
bug-fix: The only thing I did for this was to replace `var_loss = self.latency_avg_weight expected_delays_var
with
var_loss = self.latency_var_weight * expected_delays_var`