Closed Zrachel closed 6 years ago
Yep, I'm seeing the same thing. I'll investigate a bit and get back to you shortly.
I dug into it a bit. It seems lr=2.5 is fragile and are sensitive to the random seed. We also changed the way we seed the RNG (e.g., 104cead16ef010465228635158ae02b44b2e8210; 5ef59abd1fb2cde1615d316ecc5185ee7b9ccfc7), which means that the default seed of 1 no longer produces the same results as we had before, and instead produces the "exploding" behavior you observed.
Usually we try to find hyper-parameters that work well across several random seeds, but in this case, either because of changes in the code or luck when originally tuning the lr, this configuration seems to be quite unstable. You can try with lr=1.25, which should be more stable and give comparable results, although I'm not sure why your lr=1.25 run seems to plateau around BLEU=34...... You can also try several seeds with lr=2.5 until one makes it past ~10k updates (e.g., I tried seed=10 and it worked).
Hi @Zrachel
I still don't understand why dictionary files that you had shared with us are encoded with ISO-8859-1, that might be a clue to why you have such a low BLEU score. So, let me ask a few questions:
1) How do you measure the BLEU score? Do you use our code or something else?
2) What does echo $LANG
report on your system? If it is nor "en_US.UTF-8" can you try export LAND="en_US.UTF-8"
and re-do everything, including data preparation?
3) Did you move your data between linux and windows?
4) Did you mix data with lua torch fairseq?
5) What is apply_bpe_fix.py ? How is it different from apply_bpe.py ?
In theory, you don't even need to train for that many epochs to see that something is wrong, after epoch 3, you should have solid 37-38 BLEU, regardless of the learning rate (unless it's really small or too big), 1.25 is a pretty good learning rate, so there must be some other problem.
Can you also try training on IWSLT and see if you can reach BLEU > 31 with 8 GPU and max-tokens 1000 ? (Using the prepare script we provided in data)
Another thing, maybe try WMT14 En2De and see if you can achieve BLEU > 25?
That warning you see "Warning! 1 samples are either too short or too long and will be ignored, sample ids=[28743556]" is also very suspicious, but I'm running out of ideas regarding this one
Thank you @myleott and @edunov .
On IWSLT: I got BLEU=30.41 on IWSLT with 4 GPUs.
python PyFairseq/train.py data/output --save-dir local/train/model -s de -t en --arch fconv_iwslt_de_en --max-tokens 1000 --dropout 0.2 --lr 0.25 --clip-norm 0.1 --momentum 0.99
Namespace(arch='fconv_iwslt_de_en', clip_norm=0.1, data='data/output', decoder_attention='True', decoder_embed_dim=256, decoder_layers='[(256, 3)] * 3', decoder_out_embed_dim=256, dropo
ut=0.2, encoder_embed_dim=256, encoder_layers='[(256, 3)] * 4', force_anneal=0, label_smoothing=0, log_interval=1000, lr=0.25, lrshrink=0.1, max_epoch=0, max_positions=1024, max_tokens=
1000, min_lr=1e-05, model='fconv', momentum=0.99, no_epoch_checkpoints=False, no_progress_bar=False, no_save=False, restore_file='checkpoint_last.pt', sample_without_replacement=0, save
_dir='local/train/model', save_interval=-1, seed=1, source_lang='de', target_lang='en', test_subset='test', train_subset='train', valid_subset='valid', weight_decay=0.0, workers=1)
| [de] dictionary: 21577 types
| [en] dictionary: 16051 types
| data/output train 160215 examples
| data/output valid 7282 examples
| data/output test 6750 examples
| using 4 GPUs (with max tokens per GPU = 1000)
On WMT14 en-de: Yes, I got BLEU=25.35 on testset with 8GPUs.
Great, seems like you have reasonably good results on two other datasets, there are ways to push these numbers up, but for the default setup, this is what we expect to see.
For En2Fr, to the contrary, results are bad... When you get your first 3 epochs trained with LANG set to en_US.UTF-8, can you please try to generate and if the BLEU score comes below 37, paste a sample output of generate.py here. Also, please paste your train and valid losses after each epoch. Hope we can deduce why it is not working
Thank you. I'm working on it.
Hello @edunov , please take a look at my last reply in ISSUE https://github.com/facebookresearch/fairseq-py/issues/41 before reading the results below.
I have modified the encoding to en_US.UTF-8
, and use with open(f, 'r', encoding='utf-8') as fd:
in dictionary.py
. After training one epoch, I get the following result:
(result of generate.py:)
Generate test with beam=5: BLEU4 = 33.19, 60.1/38.7/27.1/19.2 (BP=1.000, ratio=0.987, syslen=95426, reflen=94221)
(result of score.py:)
checkpoint1.pt/test.bleu:BLEU4 = 30.20, 59.5/36.0/23.9/16.2 (BP=1.000, ratio=0.975, syslen=83253, reflen=81194)
It still looks worse for such results. Here is a sample generated by checkpoint1.pt
:
S-39 cr@@ anes arrived on the site just after 10@@ am , and traffic on the main road was diver@@ ted after@@ wards .
T-39 des gr@@ ues sont arriv@@ ées sur place peu après 10 heures , et la circulation sur la nationale a été détour@@ née dans la fou@@ lée .
H-39 -0.34878697991371155 Les gr@@ ues arriv@@ èrent sur le site juste après 10@@ h , et le trafic sur la route principale a été détour@@ né par la suite .
@Zrachel I checked the same sentence on my side after the first epoch (remember, I'm using --remove-bpe
in generate.py so my sentences have BPE encoding removed):
S-355 Cranes arrived on the site just after 10am , and traffic on the main road was diverted afterwards . T-355 Des grues sont arrivées sur place peu après 10 heures , et la circulation sur la nationale a été détournée dans la foulée . H-355 -0.4027450680732727 Les grues sont arrivées sur le site juste après 10h , et le trafic sur la route principale a été détourné par la suite .
One thing that stands out is that you seem to have everything in test set lower-cased, while in train set you clearly have capital letters (e.g. your hypo starts with "Les"). We do not use lower casing in our training at all, so that might be a reason for the difference you observe.
Also, can you please report training and validation loss that you observe after first epoch? (You can find them in the training log)
My fault. I once removed the lowercase operation for training data, but forgot to remove in test data. Thank you very much.
Result on corrected testset:
checkpoint1.pt/test.bleu:BLEU4 = 35.73, 64.2/41.8/29.2/20.8 (BP=1.000, ratio=0.991, syslen=81952, reflen=81194)
checkpoint2.pt/test.bleu:BLEU4 = 37.20, 65.2/43.3/30.7/22.2 (BP=0.999, ratio=1.001, syslen=81142, reflen=81194)
Training and validation loss:
...
| epoch 001 | train loss 2.24 | train ppl 4.73 | s/checkpoint 77473 | words/s 16713 | words/batch 31228 | bsz 856 | lr 1.25 | clip 18% | gnorm 0.0936811
| epoch 001 | valid on 'valid' subset | valid loss 1.75 | valid ppl 3.37
...
| epoch 002 | train loss 1.76 | train ppl 3.38 | s/checkpoint 78280 | words/s 16541 | words/batch 31228 | bsz 856 | lr 1.25 | clip 0% | gnorm 0.0558167
| epoch 002 | valid on 'valid' subset | valid loss 1.63 | valid ppl 3.11
Hi @Zrachel,
If you could please upload your fixed dataset that would help me a lot. We are currently using the dataset you uploaded earlier and we are running into the same problem.
Hi @dagarcia-nvidia , here: https://drive.google.com/open?id=1bFMhfhhMhhedPAPo0TDWBfVga8dFuTE1
Thank you @Zrachel! That seems to solve the problem. Much appreciated!! :)
Following the latest code with training parameter specified by @edunov in https://github.com/facebookresearch/fairseq-py/issues/41 and
Readme.md
ofPretrained-models
, I got exploding update on WMT14 en-fr:Only change the learning rate to 1.25 would not trigger the exploding problem, but BLEU increases very slow:
My question is: Is the results I got within expectation? Should I wait for the result of
lr=1.25
, or there is something wrong with my data/config?