Can't replicate Simultaneous Machine Translation results [missing parameters & bugs]

ereday commented 2 years ago

🐛 Bug

@xutaima Following other active/closed issues related to this model, I understand that I should contact you regarding this issue. I am unable to train Multihead Monotonic Attention models (Tried both IL and Hard). There are two main issues here: First, the training command given in the official readme file seems wrong. It does not match with the paper. Searching on the old active/closed issues, I ended up using the following command for training the MMA-Hard variant:

       fairseq-train ~/wmt15_de_en/data_bin \
                    --simul-type hard_aligned --mass-preservation \
                    --criterion latency_augmented_label_smoothed_cross_entropy \
                    --latency-var-weight 0.1 --max-update 50000 \
                    --arch transformer_monotonic_vaswani_wmt_en_de_big \
                    --optimizer adam --adam-betas '(0.9, 0.98)' \
                    --lr-scheduler 'inverse_sqrt' --warmup-init-lr 1e-7  \
                    --warmup-updates 4000 --lr 5e-4 --stop-min-lr 1e-9 \
                    --clip-norm 0.0 --weight-decay 0.0001 --dropout 0.3 \
                    --label-smoothing 0.1 --max-tokens 3584 \
                    --left-pad-source --update-freq 8 --log-format json \
                    --log-file ./wmt15_de-en_mmahard.json --max-source-positions 100 \
                    --max-target-positions 100 \
                    --skip-invalid-size-inputs-valid-test \
                    --restore-file checkpoints/checkpoint_last.pt

Note that there are several important differences between the command above and readme:

First: As reported in the paper, I'm using transformer_monotonic_vaswani_wmt_en_de_big
I've noticed in old issues that you suggest adjusting update-freq based on number of GPUs used during traning. In my case, I use 8 of them, so setting update-freq to 8.
There is no publicly available information for some important parameters: For example, how do I need to set max-source-positions, max-target-positions, max-update?

So, my first question is: Could you share the complete & correct training command, please?

More importantly, I'd like to report a very critical bug on the label_smoothed_cross_entropy_latency_augmented loss function. In the current implementation [LINK], both weighted_avera_latency and head divergence loss are weighted using the same coefficient latency_avg_weight which is set to 0.0 for MMA-hard model. This means, no latency regularisation term is used in model training.

I trained two models using the above command: before and after fixing the bug in the loss function on my local.

The model trained with buggy loss function ended up with an acceptable bleu score for WMT15 de-en (still 1.5 lower than what you reported in the paper for some reason, but it is OK for now). However, because it was trained with no latency regularization term, model has terrible latency scores:

BLEU	AL.	AP	DAL
26.2	21.69	1	21.69

The model trained after bug-fix* has much better latency scores. However, it made the bleu score drop alot:

lambda_var	BLEU	AL	DAL
0.01	25.4	20.38	21.56
0.02	23.8	16.77	21.15
0.05	18.9	20.55	21.02
0.1	14.1	2.9	7.06
0.4	18.5	2.32	5.66

As you can see, I tried many different various lambda parameters and none of them worked well.

According to the Table-6 in your paper, when I set lambda to 0.1, I should be able to get BLEU scores of 28.5 and DAL scores of 10.83. My second question is: Could you please tell me what do I need to get similar results?

bug-fix: The only thing I did for this was to replace `var_loss = self.latency_avg_weight expected_delays_varwithvar_loss = self.latency_var_weight * expected_delays_var`

ereday commented 2 years ago

I would appreciate it greatly if you can let me know what needs to be done in order to perform a proper training, @xutaima . Thanks in advance.

ereday commented 2 years ago

Looking at old issues related to SiMT, I see that you (@jmp84 ) can also help me on this issue. I would appreciate it greatly if you can let me know what needs to be done in order to perform a proper training.

sathishreddy commented 2 years ago

@ereday ... I am trying to reproduce the results for "infinite_lookback" attention. Did you have any results with this? My BLEU scores are very low. I have fixed above bug you have mentioned, but still no improvements. Thanks in advance.

EricLina commented 2 years ago

@ereday ... I am trying to reproduce the results for "infinite_lookback" attention. Did you have any results with this? My BLEU scores are very low. I have fixed above bug you have mentioned, but still no improvements. Thanks in advance.

Hello @sathishreddy , Have you found the solution ?

facebookresearch / fairseq

Can't replicate Simultaneous Machine Translation results [missing parameters & bugs] #4001

🐛 Bug