Why The Test Result of Transformer NMT Task with 4 GPUs Is Worse Than What Is Reported in Readme

yaoyiran commented 5 years ago

In the readme file, 4 GPUs can achieve a BLEU of 28.35 and even 28.67 when training more epochs.

GPU count	Mixed precision BLEU	fp32 BLEU	Mixed precision training time	fp32 training time
8	28.69	28.43	446 min	1896 min
4	28.35	28.31	834 min	3733 min

GPU count	Precision	BLEU score	Epochs to train	Training time
4	fp16	28.67	74	1925 min
4	fp32	28.40	47	5478 min

However, I have run the code with 4 GPUs and I did not modify the code at all but The Best Result I got is 27.63 on my "checkpoint_best.pt" which is epoch 19 in my case. I have run totally 80 epochs and the best BLEU over all those epochs is 28.13 which is not considered as the "checkpoint_best.pt" in the validation process.

I used the following command line to train the model:

nohup python -m torch.distributed.launch --nproc_per_node 4 /workspace/translation/train.py /workspace/data-bin/wmt14_en_de_joined_dict \ --arch transformer_wmt_en_de_big_t2t \ --share-all-embeddings \ --optimizer adam \ --adam-betas '(0.9, 0.997)' \ --adam-eps "1e-9" \ --clip-norm 0.0 \ --lr-scheduler inverse_sqrt \ --warmup-init-lr 0.0 \ --update-freq 2 \ --warmup-updates 8000 \ --lr 0.0006 \ --min-lr 0.0 \ --dropout 0.1 \ --weight-decay 0.0 \ --criterion label_smoothed_cross_entropy \ --label-smoothing 0.1 \ --max-tokens 5120 \ --seed 1 \ --max-epoch 80 \ --ignore-case \ --fp16 \ --save-dir /workspace/checkpoints \ --distributed-init-method env:// > train.nohup.out &

I also tried different warmup-updates and lr, and the results are similar. The result I got is like:

Test Checkpoint1 | Translated 3003 sentences (84994 tokens) in 25.2s (119.35 sentences/s, 3377.84 tokens/s) | Generate test with beam=4: BLEU4 = 18.11, 50.2/23.5/12.7/7.2 (BP=1.000, ratio=1.041, syslen=67147, reflen=64512) Test Checkpoint2 | Translated 3003 sentences (87704 tokens) in 27.5s (109.17 sentences/s, 3188.43 tokens/s) | Generate test with beam=4: BLEU4 = 21.26, 52.5/26.7/15.5/9.4 (BP=1.000, ratio=1.061, syslen=68450, reflen=64512) Test Checkpoint3 | Translated 3003 sentences (86611 tokens) in 25.8s (116.61 sentences/s, 3363.17 tokens/s) | Generate test with beam=4: BLEU4 = 23.91, 55.5/29.5/17.8/11.2 (BP=1.000, ratio=1.040, syslen=67079, reflen=64512) Test Checkpoint4 | Translated 3003 sentences (86518 tokens) in 25.8s (116.61 sentences/s, 3359.54 tokens/s) | Generate test with beam=4: BLEU4 = 25.26, 56.7/30.9/19.0/12.3 (BP=1.000, ratio=1.035, syslen=66758, reflen=64512) Test Checkpoint5 | Translated 3003 sentences (86768 tokens) in 25.7s (116.96 sentences/s, 3379.47 tokens/s) | Generate test with beam=4: BLEU4 = 25.63, 56.8/31.2/19.4/12.5 (BP=1.000, ratio=1.034, syslen=66698, reflen=64512) Test Checkpoint6 | Translated 3003 sentences (87220 tokens) in 25.8s (116.21 sentences/s, 3375.30 tokens/s) | Generate test with beam=4: BLEU4 = 25.98, 56.9/31.5/19.8/12.9 (BP=1.000, ratio=1.042, syslen=67205, reflen=64512) Test Checkpoint7 | Translated 3003 sentences (87715 tokens) in 25.9s (115.80 sentences/s, 3382.54 tokens/s) | Generate test with beam=4: BLEU4 = 26.24, 57.2/31.8/20.0/13.0 (BP=1.000, ratio=1.045, syslen=67413, reflen=64512) Test Checkpoint8 | Translated 3003 sentences (87808 tokens) in 26.8s (111.88 sentences/s, 3271.39 tokens/s) | Generate test with beam=4: BLEU4 = 26.82, 57.6/32.3/20.5/13.6 (BP=1.000, ratio=1.045, syslen=67444, reflen=64512) Test Checkpoint9 | Translated 3003 sentences (87394 tokens) in 25.6s (117.26 sentences/s, 3412.38 tokens/s) | Generate test with beam=4: BLEU4 = 26.63, 57.8/32.2/20.3/13.3 (BP=1.000, ratio=1.039, syslen=67033, reflen=64512) Test Checkpoint10 | Translated 3003 sentences (86825 tokens) in 25.8s (116.31 sentences/s, 3362.82 tokens/s) | Generate test with beam=4: BLEU4 = 27.10, 58.1/32.7/20.7/13.7 (BP=1.000, ratio=1.031, syslen=66541, reflen=64512) Test Checkpoint11 | Translated 3003 sentences (86850 tokens) in 25.9s (116.11 sentences/s, 3358.03 tokens/s) | Generate test with beam=4: BLEU4 = 27.29, 58.1/32.8/20.9/13.9 (BP=1.000, ratio=1.032, syslen=66563, reflen=64512) Test Checkpoint12 | Translated 3003 sentences (87137 tokens) in 26.2s (114.74 sentences/s, 3329.31 tokens/s) | Generate test with beam=4: BLEU4 = 27.28, 58.2/32.9/20.9/13.8 (BP=1.000, ratio=1.035, syslen=66787, reflen=64512) Test Checkpoint13 | Translated 3003 sentences (86810 tokens) in 25.6s (117.41 sentences/s, 3393.98 tokens/s) | Generate test with beam=4: BLEU4 = 27.26, 58.3/32.9/20.9/13.8 (BP=1.000, ratio=1.031, syslen=66500, reflen=64512) Test Checkpoint14 | Translated 3003 sentences (87359 tokens) in 25.8s (116.30 sentences/s, 3383.15 tokens/s) | Generate test with beam=4: BLEU4 = 27.69, 58.3/33.2/21.3/14.3 (BP=1.000, ratio=1.036, syslen=66830, reflen=64512) Test Checkpoint15 | Translated 3003 sentences (87415 tokens) in 26.3s (114.33 sentences/s, 3327.98 tokens/s) | Generate test with beam=4: BLEU4 = 27.37, 58.1/32.9/21.0/14.0 (BP=1.000, ratio=1.038, syslen=66951, reflen=64512) Test Checkpoint16 | Translated 3003 sentences (87332 tokens) in 26.7s (112.51 sentences/s, 3272.10 tokens/s) | Generate test with beam=4: BLEU4 = 27.33, 58.1/32.9/21.0/13.9 (BP=1.000, ratio=1.039, syslen=66998, reflen=64512) Test Checkpoint17 | Translated 3003 sentences (86721 tokens) in 25.9s (116.06 sentences/s, 3351.62 tokens/s) | Generate test with beam=4: BLEU4 = 27.32, 58.4/33.0/20.9/13.8 (BP=1.000, ratio=1.029, syslen=66385, reflen=64512) Test Checkpoint18 | Translated 3003 sentences (87388 tokens) in 26.2s (114.71 sentences/s, 3338.08 tokens/s) | Generate test with beam=4: BLEU4 = 27.57, 58.3/33.1/21.2/14.2 (BP=1.000, ratio=1.038, syslen=66956, reflen=64512) Test Checkpoint19 | Translated 3003 sentences (86919 tokens) in 25.8s (116.28 sentences/s, 3365.50 tokens/s) | Generate test with beam=4: BLEU4 = 27.63, 58.6/33.3/21.2/14.1 (BP=1.000, ratio=1.033, syslen=66642, reflen=64512) Test Checkpoint20 | Translated 3003 sentences (87485 tokens) in 26.1s (115.24 sentences/s, 3357.16 tokens/s) | Generate test with beam=4: BLEU4 = 27.48, 58.1/33.0/21.1/14.1 (BP=1.000, ratio=1.037, syslen=66924, reflen=64512) Test Checkpoint21 | Translated 3003 sentences (86993 tokens) in 26.3s (114.07 sentences/s, 3304.46 tokens/s) | Generate test with beam=4: BLEU4 = 27.77, 58.5/33.3/21.4/14.3 (BP=1.000, ratio=1.032, syslen=66564, reflen=64512) Test Checkpoint22 | Translated 3003 sentences (87084 tokens) in 25.4s (118.07 sentences/s, 3424.04 tokens/s) | Generate test with beam=4: BLEU4 = 27.87, 58.6/33.3/21.5/14.4 (BP=1.000, ratio=1.032, syslen=66595, reflen=64512) Test Checkpoint23 | Translated 3003 sentences (87013 tokens) in 26.4s (113.92 sentences/s, 3300.98 tokens/s) | Generate test with beam=4: BLEU4 = 27.59, 58.4/33.2/21.2/14.1 (BP=1.000, ratio=1.033, syslen=66626, reflen=64512) Test Checkpoint24 | Translated 3003 sentences (86741 tokens) in 26.0s (115.49 sentences/s, 3335.84 tokens/s) | Generate test with beam=4: BLEU4 = 27.98, 58.7/33.5/21.6/14.4 (BP=1.000, ratio=1.029, syslen=66379, reflen=64512) Test Checkpoint25 | Translated 3003 sentences (86884 tokens) in 25.4s (118.05 sentences/s, 3415.42 tokens/s) | Generate test with beam=4: BLEU4 = 27.94, 58.8/33.5/21.5/14.4 (BP=1.000, ratio=1.029, syslen=66392, reflen=64512) Test Checkpoint26 | Translated 3003 sentences (86840 tokens) in 26.4s (113.68 sentences/s, 3287.46 tokens/s) | Generate test with beam=4: BLEU4 = 27.91, 58.7/33.5/21.5/14.4 (BP=1.000, ratio=1.028, syslen=66344, reflen=64512) Test Checkpoint27 | Translated 3003 sentences (87050 tokens) in 26.2s (114.45 sentences/s, 3317.73 tokens/s) | Generate test with beam=4: BLEU4 = 27.88, 58.7/33.4/21.5/14.3 (BP=1.000, ratio=1.030, syslen=66451, reflen=64512) Test Checkpoint28 | Translated 3003 sentences (86981 tokens) in 25.8s (116.40 sentences/s, 3371.53 tokens/s) | Generate test with beam=4: BLEU4 = 27.80, 58.7/33.3/21.4/14.3 (BP=1.000, ratio=1.031, syslen=66488, reflen=64512) Test Checkpoint29 | Translated 3003 sentences (86219 tokens) in 25.6s (117.33 sentences/s, 3368.59 tokens/s) | Generate test with beam=4: BLEU4 = 27.82, 58.8/33.4/21.4/14.3 (BP=1.000, ratio=1.022, syslen=65941, reflen=64512) Test Checkpoint30 | Translated 3003 sentences (86879 tokens) in 26.9s (111.61 sentences/s, 3229.04 tokens/s) | Generate test with beam=4: BLEU4 = 27.88, 58.6/33.4/21.5/14.4 (BP=1.000, ratio=1.031, syslen=66501, reflen=64512) Test Checkpoint31 | Translated 3003 sentences (87082 tokens) in 26.6s (112.83 sentences/s, 3271.95 tokens/s) | Generate test with beam=4: BLEU4 = 28.00, 58.8/33.6/21.6/14.4 (BP=1.000, ratio=1.032, syslen=66570, reflen=64512) Test Checkpoint32 | Translated 3003 sentences (86677 tokens) in 26.6s (112.93 sentences/s, 3259.43 tokens/s) | Generate test with beam=4: BLEU4 = 27.98, 58.8/33.5/21.6/14.4 (BP=1.000, ratio=1.028, syslen=66289, reflen=64512) Test Checkpoint33 | Translated 3003 sentences (87034 tokens) in 26.2s (114.54 sentences/s, 3319.61 tokens/s) | Generate test with beam=4: BLEU4 = 28.10, 58.8/33.6/21.7/14.5 (BP=1.000, ratio=1.032, syslen=66553, reflen=64512) Test Checkpoint34 | Translated 3003 sentences (87064 tokens) in 26.3s (114.28 sentences/s, 3313.16 tokens/s) | Generate test with beam=4: BLEU4 = 27.92, 58.4/33.3/21.6/14.4 (BP=1.000, ratio=1.031, syslen=66534, reflen=64512) Test Checkpoint35 | Translated 3003 sentences (86818 tokens) in 26.6s (112.86 sentences/s, 3262.78 tokens/s) | Generate test with beam=4: BLEU4 = 28.11, 58.9/33.7/21.7/14.5 (BP=1.000, ratio=1.028, syslen=66336, reflen=64512) Test Checkpoint36 | Translated 3003 sentences (87037 tokens) in 25.9s (115.89 sentences/s, 3358.98 tokens/s) | Generate test with beam=4: BLEU4 = 28.18, 58.8/33.6/21.8/14.6 (BP=1.000, ratio=1.031, syslen=66483, reflen=64512) Test Checkpoint37 | Translated 3003 sentences (86740 tokens) in 25.7s (116.91 sentences/s, 3376.92 tokens/s) | Generate test with beam=4: BLEU4 = 28.19, 58.9/33.7/21.8/14.6 (BP=1.000, ratio=1.026, syslen=66197, reflen=64512) Test Checkpoint38 | Translated 3003 sentences (87084 tokens) in 26.1s (115.05 sentences/s, 3336.24 tokens/s) | Generate test with beam=4: BLEU4 = 28.01, 58.7/33.5/21.6/14.5 (BP=1.000, ratio=1.032, syslen=66551, reflen=64512) Test Checkpoint39 | Translated 3003 sentences (86972 tokens) in 27.7s (108.47 sentences/s, 3141.58 tokens/s) | Generate test with beam=4: BLEU4 = 28.10, 58.7/33.5/21.7/14.6 (BP=1.000, ratio=1.030, syslen=66456, reflen=64512) Test Checkpoint40 | Translated 3003 sentences (86717 tokens) in 25.7s (116.94 sentences/s, 3376.78 tokens/s) | Generate test with beam=4: BLEU4 = 27.81, 58.7/33.4/21.4/14.2 (BP=1.000, ratio=1.028, syslen=66314, reflen=64512) Test Checkpoint41 | Translated 3003 sentences (86542 tokens) in 26.0s (115.52 sentences/s, 3329.06 tokens/s) | Generate test with beam=4: BLEU4 = 27.69, 58.9/33.3/21.3/14.1 (BP=1.000, ratio=1.025, syslen=66127, reflen=64512) Test Checkpoint42 | Translated 3003 sentences (86841 tokens) in 27.1s (110.96 sentences/s, 3208.64 tokens/s) | Generate test with beam=4: BLEU4 = 27.99, 58.7/33.5/21.6/14.5 (BP=1.000, ratio=1.028, syslen=66329, reflen=64512) Test Checkpoint43 | Translated 3003 sentences (86986 tokens) in 26.8s (111.92 sentences/s, 3241.95 tokens/s) | Generate test with beam=4: BLEU4 = 27.81, 58.6/33.3/21.4/14.3 (BP=1.000, ratio=1.031, syslen=66501, reflen=64512) Test Checkpoint44 | Translated 3003 sentences (86691 tokens) in 25.6s (117.24 sentences/s, 3384.53 tokens/s) | Generate test with beam=4: BLEU4 = 28.09, 58.8/33.6/21.7/14.6 (BP=1.000, ratio=1.026, syslen=66162, reflen=64512) Test Checkpoint45 | Translated 3003 sentences (86845 tokens) in 26.5s (113.44 sentences/s, 3280.52 tokens/s) | Generate test with beam=4: BLEU4 = 28.00, 58.8/33.5/21.6/14.4 (BP=1.000, ratio=1.029, syslen=66353, reflen=64512) Test Checkpoint46 | Translated 3003 sentences (86280 tokens) in 25.7s (116.75 sentences/s, 3354.46 tokens/s) | Generate test with beam=4: BLEU4 = 28.13, 59.0/33.6/21.7/14.6 (BP=1.000, ratio=1.021, syslen=65860, reflen=64512) Test Checkpoint47 | Translated 3003 sentences (86857 tokens) in 26.4s (113.64 sentences/s, 3286.92 tokens/s) | Generate test with beam=4: BLEU4 = 27.77, 58.6/33.3/21.4/14.3 (BP=1.000, ratio=1.029, syslen=66402, reflen=64512) Test Checkpoint48 | Translated 3003 sentences (87087 tokens) in 26.0s (115.65 sentences/s, 3353.93 tokens/s) | Generate test with beam=4: BLEU4 = 27.68, 58.4/33.2/21.3/14.2 (BP=1.000, ratio=1.032, syslen=66576, reflen=64512) Test Checkpoint49 | Translated 3003 sentences (86627 tokens) in 25.5s (117.97 sentences/s, 3402.95 tokens/s) | Generate test with beam=4: BLEU4 = 28.02, 59.0/33.6/21.6/14.4 (BP=1.000, ratio=1.026, syslen=66208, reflen=64512) Test Checkpoint50 | Translated 3003 sentences (86529 tokens) in 25.9s (116.09 sentences/s, 3345.07 tokens/s) | Generate test with beam=4: BLEU4 = 27.96, 58.8/33.5/21.5/14.4 (BP=1.000, ratio=1.024, syslen=66049, reflen=64512) Test Checkpoint51 | Translated 3003 sentences (87095 tokens) in 26.2s (114.50 sentences/s, 3320.73 tokens/s) | Generate test with beam=4: BLEU4 = 27.80, 58.6/33.4/21.4/14.3 (BP=1.000, ratio=1.030, syslen=66471, reflen=64512) Test Checkpoint52 | Translated 3003 sentences (87160 tokens) in 27.2s (110.54 sentences/s, 3208.27 tokens/s) | Generate test with beam=4: BLEU4 = 27.89, 58.6/33.4/21.5/14.4 (BP=1.000, ratio=1.032, syslen=66559, reflen=64512) Test Checkpoint53 | Translated 3003 sentences (86909 tokens) in 26.1s (114.96 sentences/s, 3326.93 tokens/s) | Generate test with beam=4: BLEU4 = 27.90, 58.8/33.5/21.5/14.3 (BP=1.000, ratio=1.029, syslen=66353, reflen=64512) Test Checkpoint54 | Translated 3003 sentences (86785 tokens) in 26.1s (114.94 sentences/s, 3321.61 tokens/s) | Generate test with beam=4: BLEU4 = 28.05, 58.8/33.6/21.6/14.5 (BP=1.000, ratio=1.028, syslen=66308, reflen=64512) Test Checkpoint55 | Translated 3003 sentences (86914 tokens) in 25.9s (115.95 sentences/s, 3355.82 tokens/s) | Generate test with beam=4: BLEU4 = 27.76, 58.5/33.3/21.4/14.2 (BP=1.000, ratio=1.029, syslen=66376, reflen=64512) Test Checkpoint56 | Translated 3003 sentences (86775 tokens) in 26.5s (113.27 sentences/s, 3273.16 tokens/s) | Generate test with beam=4: BLEU4 = 27.75, 58.5/33.2/21.4/14.3 (BP=1.000, ratio=1.028, syslen=66314, reflen=64512) Test Checkpoint57 | Translated 3003 sentences (86522 tokens) in 26.3s (114.39 sentences/s, 3295.88 tokens/s) | Generate test with beam=4: BLEU4 = 27.91, 58.9/33.4/21.5/14.3 (BP=1.000, ratio=1.024, syslen=66052, reflen=64512) Test Checkpoint58 | Translated 3003 sentences (86269 tokens) in 26.1s (114.94 sentences/s, 3301.85 tokens/s) | Generate test with beam=4: BLEU4 = 27.77, 58.7/33.3/21.4/14.2 (BP=1.000, ratio=1.021, syslen=65893, reflen=64512) Test Checkpoint59 | Translated 3003 sentences (86738 tokens) in 25.9s (115.78 sentences/s, 3344.27 tokens/s) | Generate test with beam=4: BLEU4 = 27.96, 58.5/33.4/21.6/14.5 (BP=1.000, ratio=1.029, syslen=66378, reflen=64512) Test Checkpoint60 | Translated 3003 sentences (86566 tokens) in 25.7s (116.92 sentences/s, 3370.48 tokens/s) | Generate test with beam=4: BLEU4 = 27.85, 58.7/33.4/21.5/14.3 (BP=1.000, ratio=1.025, syslen=66151, reflen=64512) Test Checkpoint61 | Translated 3003 sentences (86785 tokens) in 25.3s (118.91 sentences/s, 3436.47 tokens/s) | Generate test with beam=4: BLEU4 = 27.74, 58.7/33.3/21.3/14.2 (BP=1.000, ratio=1.028, syslen=66291, reflen=64512) Test Checkpoint62 | Translated 3003 sentences (86261 tokens) in 25.7s (116.79 sentences/s, 3354.79 tokens/s) | Generate test with beam=4: BLEU4 = 27.86, 58.8/33.4/21.5/14.3 (BP=1.000, ratio=1.021, syslen=65898, reflen=64512) Test Checkpoint63 | Translated 3003 sentences (86569 tokens) in 25.1s (119.58 sentences/s, 3447.32 tokens/s) | Generate test with beam=4: BLEU4 = 27.92, 58.8/33.5/21.5/14.4 (BP=1.000, ratio=1.025, syslen=66155, reflen=64512) Test Checkpoint64 | Translated 3003 sentences (86583 tokens) in 25.8s (116.47 sentences/s, 3357.96 tokens/s) | Generate test with beam=4: BLEU4 = 27.59, 58.5/33.2/21.2/14.1 (BP=1.000, ratio=1.025, syslen=66146, reflen=64512) Test Checkpoint65 | Translated 3003 sentences (86707 tokens) in 26.2s (114.76 sentences/s, 3313.64 tokens/s) | Generate test with beam=4: BLEU4 = 27.78, 58.5/33.3/21.4/14.2 (BP=1.000, ratio=1.028, syslen=66294, reflen=64512) Test Checkpoint66 | Translated 3003 sentences (86478 tokens) in 26.0s (115.55 sentences/s, 3327.54 tokens/s) | Generate test with beam=4: BLEU4 = 27.63, 58.5/33.2/21.3/14.1 (BP=1.000, ratio=1.025, syslen=66114, reflen=64512) Test Checkpoint67 | Translated 3003 sentences (86564 tokens) in 25.8s (116.40 sentences/s, 3355.20 tokens/s) | Generate test with beam=4: BLEU4 = 27.92, 58.6/33.4/21.5/14.4 (BP=1.000, ratio=1.026, syslen=66200, reflen=64512) Test Checkpoint68 | Translated 3003 sentences (86548 tokens) in 26.2s (114.58 sentences/s, 3302.20 tokens/s) | Generate test with beam=4: BLEU4 = 28.08, 58.8/33.6/21.7/14.5 (BP=1.000, ratio=1.024, syslen=66041, reflen=64512) Test Checkpoint69 | Translated 3003 sentences (86580 tokens) in 25.9s (116.08 sentences/s, 3346.72 tokens/s) | Generate test with beam=4: BLEU4 = 28.13, 58.8/33.7/21.7/14.6 (BP=1.000, ratio=1.026, syslen=66178, reflen=64512) Test Checkpoint70 | Translated 3003 sentences (86448 tokens) in 26.1s (115.01 sentences/s, 3310.94 tokens/s) | Generate test with beam=4: BLEU4 = 27.88, 58.8/33.5/21.5/14.3 (BP=1.000, ratio=1.023, syslen=65998, reflen=64512) Test Checkpoint71 | Translated 3003 sentences (86832 tokens) in 26.0s (115.69 sentences/s, 3345.26 tokens/s) | Generate test with beam=4: BLEU4 = 27.91, 58.6/33.4/21.5/14.4 (BP=1.000, ratio=1.029, syslen=66355, reflen=64512) Test Checkpoint72 | Translated 3003 sentences (86550 tokens) in 25.6s (117.18 sentences/s, 3377.25 tokens/s) | Generate test with beam=4: BLEU4 = 27.95, 58.8/33.5/21.5/14.4 (BP=1.000, ratio=1.024, syslen=66092, reflen=64512) Test Checkpoint73 | Translated 3003 sentences (86415 tokens) in 25.4s (118.17 sentences/s, 3400.41 tokens/s) | Generate test with beam=4: BLEU4 = 27.84, 58.8/33.4/21.4/14.3 (BP=1.000, ratio=1.023, syslen=65990, reflen=64512) Test Checkpoint74 | Translated 3003 sentences (86251 tokens) in 26.2s (114.65 sentences/s, 3292.82 tokens/s) | Generate test with beam=4: BLEU4 = 27.97, 58.8/33.5/21.6/14.4 (BP=1.000, ratio=1.021, syslen=65889, reflen=64512) Test Checkpoint75 | Translated 3003 sentences (86418 tokens) in 26.1s (115.03 sentences/s, 3310.16 tokens/s) | Generate test with beam=4: BLEU4 = 27.72, 58.6/33.2/21.3/14.2 (BP=1.000, ratio=1.023, syslen=65971, reflen=64512) Test Checkpoint76 | Translated 3003 sentences (86474 tokens) in 25.9s (116.04 sentences/s, 3341.50 tokens/s) | Generate test with beam=4: BLEU4 = 27.63, 58.6/33.2/21.2/14.1 (BP=1.000, ratio=1.023, syslen=66025, reflen=64512) Test Checkpoint77 | Translated 3003 sentences (86100 tokens) in 25.6s (117.20 sentences/s, 3360.35 tokens/s) | Generate test with beam=4: BLEU4 = 28.11, 59.1/33.7/21.7/14.5 (BP=1.000, ratio=1.018, syslen=65695, reflen=64512) Test Checkpoint78 | Translated 3003 sentences (86497 tokens) in 26.2s (114.53 sentences/s, 3298.82 tokens/s) | Generate test with beam=4: BLEU4 = 27.80, 58.7/33.4/21.4/14.3 (BP=1.000, ratio=1.024, syslen=66073, reflen=64512) Test Checkpoint79 | Translated 3003 sentences (86905 tokens) in 26.3s (114.22 sentences/s, 3305.35 tokens/s) | Generate test with beam=4: BLEU4 = 27.69, 58.5/33.2/21.3/14.2 (BP=1.000, ratio=1.028, syslen=66327, reflen=64512) Test Checkpoint80 | Translated 3003 sentences (86654 tokens) in 26.3s (114.36 sentences/s, 3300.06 tokens/s) | Generate test with beam=4: BLEU4 = 27.65, 58.5/33.2/21.3/14.1 (BP=1.000, ratio=1.026, syslen=66219, reflen=64512)

So, why I am not able to achieve the results as reported in the readme file? Could you tell me the command line that you use to run transformer on 4 GPUs?

Another question is that the "Attention is all you need" paper uses 0.1 as the initial learning rate whereas 0.0006 is used here. Why there is such a large difference on learning rate?

jbaczek commented 5 years ago

Regarding large learning rate difference, it's just because we have run extensive hyper parameter search for this particular implementation and it turned out that this LR works best. As it comes to results, they are reported for older version of the container, though it shouldn't be that much of a difference. I'll double check that.

yaoyiran commented 5 years ago

Thanks for your reply! The test results I pasted above is derived using "generate.py". I also tried 8 GPUs and got similar results where the highest score on the test set is around 28.1. On the validation set (384 sentences) sometimes I got 28.5 when using online evaluation, but it is the BLEU on validation set (384 sentences) rather than the test set (3003 sentences) at all.

I also want to point out another issue that the validation part seems not able to choose the best model. For example, in the case I posted above, the validation chooses epoch 19 (BLEU4 = 27.63) which is a very ordinary one.

jbaczek commented 5 years ago

This is happening because we choose the model with the lowest validation loss as the best model, not the one with the highest BLEU score. We can of course choose the model with the highest score, but then our test set becomes just another dev set. Online evaluation distributes test set between all GPUs for faster computation. To validate that everything is ok use print(something, args.distributed_rank, force=True). That line will print something from all the processes, not only the first one.

yaoyiran commented 5 years ago

I have found another problem on the model and pre-processing: since scaling18 data is adopted "bash prepare-wmt14en2de.sh --scaling18", why not use "transformer_vaswani_wmt_en_de_big" but use "transformer_wmt_en_de_big_t2t" to train the model as specified in the readme file in DeepLearningExamples/PyTorch/Translation/Transformer/examples/translation/? Do you know how is the difference if we do not use scaling18 data?

jbaczek commented 5 years ago

"transformer_wmt_en_de_big_t2t" is very similar model to "transformer_vaswani_wmt_en_de_big". It just uses hyperparameters from tensor2tensor implementation. You can find all differences in fairseq/models/transformer.py file at the bottom. Regarding --scaling18 option it only alters vocabulary size. This value makes a dataset compliant to the one used in the https://arxiv.org/abs/1806.00187 paper.

yaoyiran commented 5 years ago

For the dataset, it seems that the training set is a little bit different from the attention is all you need paper: On "http://statmt.org/wmt14/translation-task.html#Download", the newscommentary data is training-parallel-nc-v9.tgz; however, training-parallel-nc-v12.tgz is adopted here. I think this may not influence the result because newscommentary data is only a tiny small part of the whole training set.

Another thing is that when --scaling18 is not used, /workspace/data-bin/wmt14_en_de_joined_dict train 3961179 examples, when --scaling18 is used, /workspace/data-bin/wmt14_en_de_joined_dict train 4575637 examples which means that without scaling18, 600000 samples are discarded. I am not sure how many samples were used in the attention is all you need implementation and if that will influence the result.

yaoyiran commented 5 years ago

Today I found that the issue above may be casued by the fact that the BLEU scores calculated by generate.py is lower than that by score() fuction in train.py. So, I tried to evaluate BLEU of both valid and test data in train.py.

My question is, for line 65 in train.py: load_dataset_splits(task, ['train', 'valid']), if we rewrite it as load_dataset_splits(task, ['train', 'valid', 'test']), will that influence the training and the test procedure? I just want to see the BLEU of the valid data and the loss of the test data.

Another question is that in the new repo, train.py uses sacrebleu.corpus_bleu() to calculate BLEU, whereas the old repo (in Jan.) uses fairseq.bleu.Scorer.score(). I think they provide similar results right? But even in the old repo where fairseq.bleu.Scorer.score() is adopted both in train.py and generate.py, the online evaluation and generate.py still produce different BLEU scores.

My last question is that when doing online evaluation, the output of BLEU results for the test set is like: Translated 768 sentences (xxx tokens) in xx.xs (xxx.xx sentences/s, xxxx.xx tokens/s) when I use 4 GPUs and Translated 384 sentences (xxx tokens) in xx.xs (xxx.xx sentences/s, xxxx.xx tokens/s) when I use 8 GPUs. So, since therer are 3003 samples in the test set, the number 768 or 384 is the number of translations generated by a single GPU (3848 = 7684 = 3072 which is near 3003). Could you verify that and also verify that printed BLEU score is the BLEU on all 3003 samples rather than only 768 or 384 samples generated by a single GPU.

Thx a lot!

jbaczek commented 5 years ago

load_dataset_splits iterates over list of splits and uses task.load_dataset which keeps intenal representation of datasets as a dictionary. The reason why test split is loaded somewhere else is that, maybe you don't want to run online evaluation, not to get biased for example. You can load it in the beginning together with the remaining splits. New scoring (Sacrebleu) is added to the scoring function for informative purposes. Stopping of a training is still performed based on fairseq.bleu.Scorer.score. The reason why we have chosen this solution is our doubts that different models implement their own scoring functions which are not equivalent. More elaborate reasoning can be found in Sacrebleu paper ( https://aclweb.org/anthology/W18-6319 ). The sacrebleu score is expected to be lower than fairseq's score because it calculates detokenized score (on whole words) whereas fairseq calculates it on tokens. (you can observe this here: https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/Translation/Transformer/train.py#L357 ) I've checked distribution of test examples between GPUs with this line print('prediction len rank {}'.format(args.distributed_rank), len(predictions), force=True) and I got this

prediction len rank 7 315
prediction len rank 0 384
prediction len rank 1 384
prediction len rank 2 384
prediction len rank 3 384
prediction len rank 4 384
prediction len rank 5 384
prediction len rank 6 384

As you can see it sums up to 3003. The issue that generate.py gives different results than online evaluation needs a more attention. I'll address that in couple days

jbaczek commented 5 years ago

As it comes to results of 4 GPU training: On DGX1 16G (which was underclocked for power reasons), container 19.06 and current version of the code model reached 28.6 BLEU in 45 epochs, which have taken 1481 minutes. It converged to 28.31 BLEU in 26 epochs (851.5 min). If you need log from this run I can share it with you. If you need exact timings on full-speed DGX I can rerun this test.

yaoyiran commented 5 years ago

Thx a lot! I have got similar results with online evaluation but it is still helpful if you could share your log file. Could you leave me a drop box link or your email, etc?

jbaczek commented 5 years ago

https://drive.google.com/open?id=1Yzy3WV6UjJH8ZsxUnBmgzEUCrOgldIrz This link will be available for 12 hours. Then I'll remove it.

yaoyiran commented 5 years ago

Thx, I have downloaded it.

NVIDIA / DeepLearningExamples

Why The Test Result of Transformer NMT Task with 4 GPUs Is Worse Than What Is Reported in Readme #97