facebookresearch / XLM

PyTorch original implementation of Cross-lingual Language Model Pretraining.
Other
2.89k stars 498 forks source link

Can you give me training settings for 2 Million 1 Million and 0.4 Million of language ende enfr and enro Respectively ? #168

Closed XiaoqingNLP closed 5 years ago

XiaoqingNLP commented 5 years ago

I using XLM encoder and Decoder using 6 layers big transformer, I have try big batch and lower learning rate and not got a appropriate results. Have any training settings for me to get a appropriate results ?

glample commented 5 years ago

The parameters given in the README should be fine. How many GPUs are you using? This can make quite a difference.

XiaoqingNLP commented 5 years ago

@glample I have using the settings in Readme, And I have a 8 GPU with 16 G memory. I have implement The XLM into fairseq and can set update-freq. So I am hesitate to set. can you give me some advise ?

I currently using the XLM settings and using 8GPU, meanwhile, I set update-freq 16 as follow the settings in scanling NMT.

glample commented 5 years ago

Can you provide your training log? I can try to have a look. I don't have better recommendations that the parameters in the README.

XiaoqingNLP commented 5 years ago

as following is ende dataset and enro dataset , ende dataset's BLEU close to Transformer base and enro's BLEU only has 4% BLEU, compared with transformer-base model loss, the training loss is lower than transformer-base model and valid loss is higher than transformer-base model.

This is a valid subset loss

0.log.trans.080818.deen.r7.g0,1,2,3.xlm_translation.lll | epoch 001 | valid on 'valid' subset | loss 8.716 | nll_loss 7.510 | ppl 182.30 | num_updates 226 | epoch 002 | valid on 'valid' subset | loss 6.526 | nll_loss 4.965 | ppl 31.24 | num_updates 452 | best_loss 6.52618 | epoch 003 | valid on 'valid' subset | loss 5.383 | nll_loss 3.646 | ppl 12.52 | num_updates 678 | best_loss 5.3826 | epoch 004 | valid on 'valid' subset | loss 4.849 | nll_loss 3.046 | ppl 8.26 | num_updates 904 | best_loss 4.84942 | epoch 005 | valid on 'valid' subset | loss 4.581 | nll_loss 2.758 | ppl 6.76 | num_updates 1130 | best_loss 4.58094 | epoch 006 | valid on 'valid' subset | loss 4.449 | nll_loss 2.621 | ppl 6.15 | num_updates 1356 | best_loss 4.44871 | epoch 007 | valid on 'valid' subset | loss 4.309 | nll_loss 2.493 | ppl 5.63 | num_updates 1582 | best_loss 4.30944 | epoch 008 | valid on 'valid' subset | loss 4.223 | nll_loss 2.409 | ppl 5.31 | num_updates 1808 | best_loss 4.22342 | epoch 009 | valid on 'valid' subset | loss 4.195 | nll_loss 2.377 | ppl 5.20 | num_updates 2034 | best_loss 4.19518 | epoch 010 | valid on 'valid' subset | loss 4.123 | nll_loss 2.315 | ppl 4.97 | num_updates 2260 | best_loss 4.12328 | epoch 011 | valid on 'valid' subset | loss 4.080 | nll_loss 2.258 | ppl 4.78 | num_updates 2486 | best_loss 4.08045 | epoch 012 | valid on 'valid' subset | loss 4.059 | nll_loss 2.243 | ppl 4.74 | num_updates 2712 | best_loss 4.05907 | epoch 013 | valid on 'valid' subset | loss 4.042 | nll_loss 2.233 | ppl 4.70 | num_updates 2938 | best_loss 4.04195 | epoch 014 | valid on 'valid' subset | loss 4.002 | nll_loss 2.182 | ppl 4.54 | num_updates 3000 | best_loss 4.00215 | epoch 014 | valid on 'valid' subset | loss 4.044 | nll_loss 2.238 | ppl 4.72 | num_updates 3164 | best_loss 4.00215 | epoch 015 | valid on 'valid' subset | loss 4.011 | nll_loss 2.202 | ppl 4.60 | num_updates 3390 | best_loss 4.00215 | epoch 016 | valid on 'valid' subset | loss 4.051 | nll_loss 2.267 | ppl 4.81 | num_updates 3616 | best_loss 4.00215 | epoch 017 | valid on 'valid' subset | loss 3.947 | nll_loss 2.146 | ppl 4.42 | num_updates 3842 | best_loss 3.94704 | epoch 018 | valid on 'valid' subset | loss 4.156 | nll_loss 2.359 | ppl 5.13 | num_updates 4068 | best_loss 3.94704 | epoch 019 | valid on 'valid' subset | loss 4.057 | nll_loss 2.265 | ppl 4.81 | num_updates 4294 | best_loss 3.94704 | epoch 020 | valid on 'valid' subset | loss 3.946 | nll_loss 2.145 | ppl 4.42 | num_updates 4520 | best_loss 3.94561 | epoch 021 | valid on 'valid' subset | loss 3.953 | nll_loss 2.162 | ppl 4.48 | num_updates 4746 | best_loss 3.94561

this is a training loss log

| epoch 001 | loss 11.408 | nll_loss 10.730 | ppl 1698.45 | wps 14018 | ups 0 | wpb 262316.708 | bsz 8438.088 | num_updates 226 | lr 5.65943e-05 | gnorm 1.783 | clip 0.000 | oom 0.000 | wall 4255 | train_wall 4119 | epoch 002 | loss 7.592 | nll_loss 6.312 | ppl 79.43 | wps 14037 | ups 0 | wpb 262316.708 | bsz 8438.088 | num_updates 452 | lr 0.000113089 | gnorm 1.601 | clip 0.000 | oom 0.000 | wall 8737 | train_wall 8234 | epoch 003 | loss 5.635 | nll_loss 4.086 | ppl 16.98 | wps 14056 | ups 0 | wpb 262316.708 | bsz 8438.088 | num_updates 678 | lr 0.000169583 | gnorm 1.625 | clip 0.000 | oom 0.000 | wall 13214 | train_wall 12344 | epoch 004 | loss 4.731 | nll_loss 3.064 | ppl 8.36 | wps 14047 | ups 0 | wpb 262316.708 | bsz 8438.088 | num_updates 904 | lr 0.000226077 | gnorm 1.061 | clip 0.000 | oom 0.000 | wall 17846 | train_wall 16458 | epoch 005 | loss 4.308 | nll_loss 2.591 | ppl 6.02 | wps 14037 | ups 0 | wpb 262316.708 | bsz 8438.088 | num_updates 1130 | lr 0.000282572 | gnorm 0.672 | clip 0.000 | oom 0.000 | wall 22455 | train_wall 20575 | epoch 006 | loss 4.096 | nll_loss 2.357 | ppl 5.12 | wps 14038 | ups 0 | wpb 262316.708 | bsz 8438.088 | num_updates 1356 | lr 0.000339066 | gnorm 0.667 | clip 0.000 | oom 0.000 | wall 27147 | train_wall 24689 | epoch 007 | loss 3.905 | nll_loss 2.146 | ppl 4.43 | wps 14046 | ups 0 | wpb 262316.708 | bsz 8438.088 | num_updates 1582 | lr 0.00039556 | gnorm 0.388 | clip 0.000 | oom 0.000 | wall 31740 | train_wall 28807 | epoch 008 | loss 3.788 | nll_loss 2.019 | ppl 4.05 | wps 14046 | ups 0 | wpb 262316.708 | bsz 8438.088 | num_updates 1808 | lr 0.000452055 | gnorm 0.458 | clip 0.000 | oom 0.000 | wall 36268 | train_wall 32922 | epoch 009 | loss 3.698 | nll_loss 1.922 | ppl 3.79 | wps 14049 | ups 0 | wpb 262316.708 | bsz 8438.088 | num_updates 2034 | lr 0.000508549 | gnorm 0.368 | clip 0.000 | oom 0.000 | wall 40853 | train_wall 37037 | epoch 010 | loss 3.631 | nll_loss 1.850 | ppl 3.60 | wps 14050 | ups 0 | wpb 262316.708 | bsz 8438.088 | num_updates 2260 | lr 0.000565043 | gnorm 0.340 | clip 0.000 | oom 0.000 | wall 45497 | train_wall 41152 | epoch 011 | loss 3.582 | nll_loss 1.797 | ppl 3.47 | wps 14044 | ups 0 | wpb 262316.708 | bsz 8438.088 | num_updates 2486 | lr 0.000621538 | gnorm 0.332 | clip 0.000 | oom 0.000 | wall 50095 | train_wall 45265 | epoch 012 | loss 3.540 | nll_loss 1.752 | ppl 3.37 | wps 14033 | ups 0 | wpb 262316.708 | bsz 8438.088 | num_updates 2712 | lr 0.000678032 | gnorm 0.327 | clip 0.000 | oom 0.000 | wall 54719 | train_wall 49383 | epoch 013 | loss 3.507 | nll_loss 1.716 | ppl 3.28 | wps 14062 | ups 0 | wpb 262316.708 | bsz 8438.088 | num_updates 2938 | lr 0.000734527 | gnorm 0.317 | clip 0.000 | oom 0.000 | wall 58996 | train_wall 53493 | epoch 014 | loss 3.480 | nll_loss 1.687 | ppl 3.22 | wps 13787 | ups 0 | wpb 262316.708 | bsz 8438.088 | num_updates 3164 | lr 0.000791021 | gnorm 0.317 | clip 0.000 | oom 0.000 | wall 63615 | train_wall 57603 | epoch 015 | loss 3.460 | nll_loss 1.665 | ppl 3.17 | wps 14079 | ups 0 | wpb 262316.708 | bsz 8438.088 | num_updates 3390 | lr 0.000847515 | gnorm 0.334 | clip 0.000 | oom 0.000 | wall 67869 | train_wall 61707 | epoch 016 | loss 3.442 | nll_loss 1.646 | ppl 3.13 | wps 14065 | ups 0 | wpb 262316.708 | bsz 8438.088 | num_updates 3616 | lr 0.00090401 | gnorm 0.331 | clip 0.000 | oom 0.000 | wall 72125 | train_wall 65818

this is a enro valid loss

| epoch 001 | valid on 'valid' subset | loss 13.840 | nll_loss 13.485 | ppl 11467.09 | num_updates 67 | epoch 002 | valid on 'valid' subset | loss 12.018 | nll_loss 11.384 | ppl 2672.36 | num_updates 134 | best_loss 12.0175 | epoch 003 | valid on 'valid' subset | loss 11.343 | nll_loss 10.529 | ppl 1477.64 | num_updates 201 | best_loss 11.3429 | epoch 004 | valid on 'valid' subset | loss 10.768 | nll_loss 9.852 | ppl 923.88 | num_updates 268 | best_loss 10.768 | epoch 005 | valid on 'valid' subset | loss 10.326 | nll_loss 9.329 | ppl 643.00 | num_updates 335 | best_loss 10.3257 | epoch 006 | valid on 'valid' subset | loss 9.688 | nll_loss 8.631 | ppl 396.33 | num_updates 402 | best_loss 9.68773 | epoch 007 | valid on 'valid' subset | loss 9.182 | nll_loss 8.029 | ppl 261.20 | num_updates 469 | best_loss 9.1815 | epoch 008 | valid on 'valid' subset | loss 8.564 | nll_loss 7.327 | ppl 160.60 | num_updates 536 | best_loss 8.56389 | epoch 009 | valid on 'valid' subset | loss 8.035 | nll_loss 6.709 | ppl 104.64 | num_updates 603 | best_loss 8.03461 | epoch 010 | valid on 'valid' subset | loss 7.797 | nll_loss 6.407 | ppl 84.84 | num_updates 670 | best_loss 7.79707 | epoch 011 | valid on 'valid' subset | loss 7.360 | nll_loss 5.904 | ppl 59.89 | num_updates 737 | best_loss 7.36045 | epoch 012 | valid on 'valid' subset | loss 6.850 | nll_loss 5.339 | ppl 40.47 | num_updates 804 | best_loss 6.85043 | epoch 013 | valid on 'valid' subset | loss 6.639 | nll_loss 5.075 | ppl 33.70 | num_updates 871 | best_loss 6.63925 | epoch 014 | valid on 'valid' subset | loss 6.446 | nll_loss 4.852 | ppl 28.89 | num_updates 938 | best_loss 6.44619 | epoch 015 | valid on 'valid' subset | loss 6.283 | nll_loss 4.674 | ppl 25.53 | num_updates 1005 | best_loss 6.28338 | epoch 016 | valid on 'valid' subset | loss 9.088 | nll_loss 7.699 | ppl 207.85 | num_updates 1072 | best_loss 6.28338 | epoch 017 | valid on 'valid' subset | loss 6.134 | nll_loss 4.477 | ppl 22.27 | num_updates 1139 | best_loss 6.13374 | epoch 018 | valid on 'valid' subset | loss 6.065 | nll_loss 4.396 | ppl 21.06 | num_updates 1206 | best_loss 6.06534 | epoch 019 | valid on 'valid' subset | loss 6.681 | nll_loss 5.061 | ppl 33.38 | num_updates 1273 | best_loss 6.06534 | epoch 020 | valid on 'valid' subset | loss 5.781 | nll_loss 4.073 | ppl 16.83 | num_updates 1340 | best_loss 5.78135 | epoch 021 | valid on 'valid' subset | loss 6.543 | nll_loss 4.936 | ppl 30.61 | num_updates 1407 | best_loss 5.78135 | epoch 022 | valid on 'valid' subset | loss 5.632 | nll_loss 3.910 | ppl 15.03 | num_updates 1474 | best_loss 5.63181 | epoch 023 | valid on 'valid' subset | loss 5.874 | nll_loss 4.173 | ppl 18.04 | num_updates 1541 | best_loss 5.63181 | epoch 024 | valid on 'valid' subset | loss 5.733 | nll_loss 4.014 | ppl 16.15 | num_updates 1608 | best_loss 5.63181 | epoch 025 | valid on 'valid' subset | loss 5.544 | nll_loss 3.803 | ppl 13.96 | num_updates 1675 | best_loss 5.54364 | epoch 026 | valid on 'valid' subset | loss 5.394 | nll_loss 3.631 | ppl 12.39 | num_updates 1742 | best_loss 5.39426 | epoch 027 | valid on 'valid' subset | loss 5.482 | nll_loss 3.773 | ppl 13.67 | num_updates 1809 | best_loss 5.39426 | epoch 028 | valid on 'valid' subset | loss 5.608 | nll_loss 3.896 | ppl 14.89 | num_updates 1876 | best_loss 5.39426 | epoch 029 | valid on 'valid' subset | loss 5.368 | nll_loss 3.633 | ppl 12.41 | num_updates 1943 | best_loss 5.36773 | epoch 030 | valid on 'valid' subset | loss 5.538 | nll_loss 3.793 | ppl 13.86 | num_updates 2010 | best_loss 5.36773 | epoch 031 | valid on 'valid' subset | loss 5.386 | nll_loss 3.634 | ppl 12.42 | num_updates 2077 | best_loss 5.36773 | epoch 032 | valid on 'valid' subset | loss 5.382 | nll_loss 3.632 | ppl 12.39 | num_updates 2144 | best_loss 5.36773 | epoch 033 | valid on 'valid' subset | loss 5.372 | nll_loss 3.632 | ppl 12.40 | num_updates 2211 | best_loss 5.36773 | epoch 034 | valid on 'valid' subset | loss 5.633 | nll_loss 3.930 | ppl 15.24 | num_updates 2278 | best_loss 5.36773 | epoch 035 | valid on 'valid' subset | loss 5.390 | nll_loss 3.631 | ppl 12.39 | num_updates 2345 | best_loss 5.36773 | epoch 036 | valid on 'valid' subset | loss 5.580 | nll_loss 3.866 | ppl 14.58 | num_updates 2412 | best_loss 5.36773

this is a enro training loss

| epoch 001 | loss 14.220 | nll_loss 13.935 | ppl 15664.53 | wps 16185 | ups 0 | wpb 275435.075 | bsz 6681.955 | num_updates 67 | lr 1.68483e-05 | gnorm 3.306 | clip 0.000 | oom 0.000 | wall 1148 | train_wall 1117 | epoch 002 | loss 11.040 | nll_loss 10.356 | ppl 1310.37 | wps 16220 | ups 0 | wpb 275435.075 | bsz 6681.955 | num_updates 134 | lr 3.35966e-05 | gnorm 1.137 | clip 0.000 | oom 0.000 | wall 2440 | train_wall 2232 | epoch 003 | loss 8.994 | nll_loss 7.973 | ppl 251.31 | wps 16189 | ups 0 | wpb 275435.075 | bsz 6681.955 | num_updates 201 | lr 5.0345e-05 | gnorm 1.297 | clip 0.000 | oom 0.000 | wall 3703 | train_wall 3348 | epoch 004 | loss 8.107 | nll_loss 6.916 | ppl 120.76 | wps 16185 | ups 0 | wpb 275435.075 | bsz 6681.955 | num_updates 268 | lr 6.70933e-05 | gnorm 0.682 | clip 0.000 | oom 0.000 | wall 5089 | train_wall 4465 | epoch 005 | loss 7.443 | nll_loss 6.148 | ppl 70.92 | wps 16207 | ups 0 | wpb 275435.075 | bsz 6681.955 | num_updates 335 | lr 8.38416e-05 | gnorm 1.165 | clip 0.000 | oom 0.000 | wall 6477 | train_wall 5581 | epoch 006 | loss 6.824 | nll_loss 5.439 | ppl 43.39 | wps 16182 | ups 0 | wpb 275435.075 | bsz 6681.955 | num_updates 402 | lr 0.00010059 | gnorm 1.270 | clip 0.000 | oom 0.000 | wall 7868 | train_wall 6699 | epoch 007 | loss 6.166 | nll_loss 4.692 | ppl 25.84 | wps 16150 | ups 0 | wpb 275435.075 | bsz 6681.955 | num_updates 469 | lr 0.000117338 | gnorm 1.976 | clip 0.000 | oom 0.000 | wall 9361 | train_wall 7818 | epoch 008 | loss 5.564 | nll_loss 4.006 | ppl 16.07 | wps 16213 | ups 0 | wpb 275435.075 | bsz 6681.955 | num_updates 536 | lr 0.000134087 | gnorm 1.932 | clip 0.000 | oom 0.000 | wall 10884 | train_wall 8934 | epoch 009 | loss 5.048 | nll_loss 3.418 | ppl 10.69 | wps 16201 | ups 0 | wpb 275435.075 | bsz 6681.955 | num_updates 603 | lr 0.000150835 | gnorm 1.926 | clip 0.000 | oom 0.000 | wall 12428 | train_wall 10050 | epoch 010 | loss 4.573 | nll_loss 2.879 | ppl 7.36 | wps 16184 | ups 0 | wpb 275435.075 | bsz 6681.955 | num_updates 670 | lr 0.000167583 | gnorm 1.721 | clip 0.000 | oom 0.000 | wall 14057 | train_wall 11167 | epoch 011 | loss 4.289 | nll_loss 2.555 | ppl 5.88 | wps 16218 | ups 0 | wpb 275435.075 | bsz 6681.955 | num_updates 737 | lr 0.000184332 | gnorm 1.491 | clip 0.000 | oom 0.000 | wall 15592 | train_wall 12282 | epoch 012 | loss 3.989 | nll_loss 2.215 | ppl 4.64 | wps 16213 | ups 0 | wpb 275435.075 | bsz 6681.955 | num_updates 804 | lr 0.00020108 | gnorm 1.106 | clip 0.000 | oom 0.000 | wall 17279 | train_wall 13398 | epoch 013 | loss 3.907 | nll_loss 2.121 | ppl 4.35 | wps 16182 | ups 0 | wpb 275435.075 | bsz 6681.955 | num_updates 871 | lr 0.000217828 | gnorm 1.186 | clip 0.000 | oom 0.000 | wall 19025 | train_wall 14516 | epoch 014 | loss 3.684 | nll_loss 1.871 | ppl 3.66 | wps 16163 | ups 0 | wpb 275435.075 | bsz 6681.955 | num_updates 938 | lr 0.000234577 | gnorm 0.943 | clip 0.000 | oom 0.000 | wall 20591 | train_wall 15635 | epoch 015 | loss 3.636 | nll_loss 1.816 | ppl 3.52 | wps 16189 | ups 0 | wpb 275435.075 | bsz 6681.955 | num_updates 1005 | lr 0.000251325 | gnorm 0.916 | clip 0.000 | oom 0.000 | wall 22239 | train_wall 16752 | epoch 016 | loss 3.495 | nll_loss 1.662 | ppl 3.16 | wps 16231 | ups 0 | wpb 275435.075 | bsz 6681.955 | num_updates 1072 | lr 0.000268073 | gnorm 0.905 | clip 0.000 | oom 0.000 | wall 23966 | train_wall 17866 | epoch 017 | loss 3.795 | nll_loss 1.989 | ppl 3.97 | wps 16211 | ups 0 | wpb 275435.075 | bsz 6681.955 | num_updates 1139 | lr 0.000284822 | gnorm 1.127 | clip 0.000 | oom 0.000 | wall 25400 | train_wall 18982 | epoch 018 | loss 3.372 | nll_loss 1.522 | ppl 2.87 | wps 16190 | ups 0 | wpb 275435.075 | bsz 6681.955 | num_updates 1206 | lr 0.00030157 | gnorm 0.418 | clip 0.000 | oom 0.000 | wall 27064 | train_wall 20099 | epoch 019 | loss 3.345 | nll_loss 1.495 | ppl 2.82 | wps 16231 | ups 0 | wpb 275435.075 | bsz 6681.955 | num_updates 1273 | lr 0.000318318 | gnorm 0.815 | clip 0.000 | oom 0.000 | wall 28628 | train_wall 21214 | epoch 020 | loss 3.362 | nll_loss 1.510 | ppl 2.85 | wps 16166 | ups 0 | wpb 275435.075 | bsz 6681.955 | num_updates 1340 | lr 0.000335066 | gnorm 0.516 | clip 0.000 | oom 0.000 | wall 30133 | train_wall 22333 | epoch 021 | loss 3.220 | nll_loss 1.359 | ppl 2.57 | wps 16200 | ups 0 | wpb 275435.075 | bsz 6681.955 | num_updates 1407 | lr 0.000351815 | gnorm 0.598 | clip 0.000 | oom 0.000 | wall 31770 | train_wall 23449 | epoch 022 | loss 3.304 | nll_loss 1.447 | ppl 2.73 | wps 16172 | ups 0 | wpb 275435.075 | bsz 6681.955 | num_updates 1474 | lr 0.000368563 | gnorm 0.606 | clip 0.000 | oom 0.000 | wall 33260 | train_wall 24567 | epoch 023 | loss 3.265 | nll_loss 1.410 | ppl 2.66 | wps 16185 | ups 0 | wpb 275435.075 | bsz 6681.955 | num_updates 1541 | lr 0.000385311 | gnorm 0.748 | clip 0.000 | oom 0.000 | wall 34811 | train_wall 25684 | epoch 024 | loss 3.141 | nll_loss 1.270 | ppl 2.41 | wps 16162 | ups 0 | wpb 275435.075 | bsz 6681.955 | num_updates 1608 | lr 0.00040206 | gnorm 0.237 | clip 0.000 | oom 0.000 | wall 36313 | train_wall 26804 | epoch 025 | loss 3.081 | nll_loss 1.207 | ppl 2.31 | wps 16170 | ups 0 | wpb 275435.075 | bsz 6681.955 | num_updates 1675 | lr 0.000418808 | gnorm 0.432 | clip 0.000 | oom 0.000 | wall 37763 | train_wall 27922 | epoch 026 | loss 3.011 | nll_loss 1.132 | ppl 2.19 | wps 16231 | ups 0 | wpb 275435.075 | bsz 6681.955 | num_updates 1742 | lr 0.000435556 | gnorm 0.253 | clip 0.000 | oom 0.000 | wall 39370 | train_wall 29036 | epoch 027 | loss 2.973 | nll_loss 1.091 | ppl 2.13 | wps 16190 | ups 0 | wpb 275435.075 | bsz 6681.955 | num_updates 1809 | lr 0.000452305 | gnorm 0.336 | clip 0.000 | oom 0.000 | wall 40982 | train_wall 30152 | epoch 028 | loss 2.936 | nll_loss 1.051 | ppl 2.07 | wps 16181 | ups 0 | wpb 275435.075 | bsz 6681.955 | num_updates 1876 | lr 0.000469053 | gnorm 0.338 | clip 0.000 | oom 0.000 | wall 42437 | train_wall 31270 | epoch 029 | loss 2.897 | nll_loss 1.008 | ppl 2.01 | wps 16242 | ups 0 | wpb 275435.075 | bsz 6681.955 | num_updates 1943 | lr 0.000485801 | gnorm 0.291 | clip 0.000 | oom 0.000 | wall 43880 | train_wall 32383 | epoch 030 | loss 3.320 | nll_loss 1.465 | ppl 2.76 | wps 16208 | ups 0 | wpb 275435.075 | bsz 6681.955 | num_updates 2010 | lr 0.00050255 | gnorm 1.302 | clip 0.000 | oom 0.000 | wall 45434 | train_wall 33498 | epoch 031 | loss 2.885 | nll_loss 0.992 | ppl 1.99 | wps 16169 | ups 0 | wpb 275435.075 | bsz 6681.955 | num_updates 2077 | lr 0.000519298 | gnorm 0.168 | clip 0.000 | oom 0.000 | wall 46887 | train_wall 34616 | epoch 032 | loss 2.836 | nll_loss 0.941 | ppl 1.92 | wps 16203 | ups 0 | wpb 275435.075 | bsz 6681.955 | num_updates 2144 | lr 0.000536046 | gnorm 0.205 | clip 0.000 | oom 0.000 | wall 48312 | train_wall 35733 | epoch 033 | loss 2.816 | nll_loss 0.919 | ppl 1.89 | wps 16215 | ups 0 | wpb 275435.075 | bsz 6681.955 | num_updates 2211 | lr 0.000552795 | gnorm 0.282 | clip 0.000 | oom 0.000 | wall 49751 | train_wall 36848 | epoch 034 | loss 2.800 | nll_loss 0.902 | ppl 1.87 | wps 16187 | ups 0 | wpb 275435.075 | bsz 6681.955 | num_updates 2278 | lr 0.000569543 | gnorm 0.295 | clip 0.000 | oom 0.000 | wall 51147 | train_wall 37965 | epoch 035 | loss 2.781 | nll_loss 0.882 | ppl 1.84 | wps 16237 | ups 0 | wpb 275435.075 | bsz 6681.955 | num_updates 2345 | lr 0.000586291 | gnorm 0.301 | clip 0.000 | oom 0.000 | wall 52633 | train_wall 39079

glample commented 5 years ago

Can you provide full train.log file? So I can see exact parameters used

XiaoqingNLP commented 5 years ago

this is a full train.log file on ende dataset. Using 8 GPU and it print model strcuture with repeat 8 times.

+ '[' 4 == 0 ']'
+ '[' 4 '!=' 4 -a 4 '!=' 3 ']'
+ '[' 4 = 4 ']'
+ GPU=0,1,2,3,4,5,6,7
+ src=en
+ tgt=de
+ model=xlm_translation
+ '[' en '!=' en ']'
+ ln_pair=ende
+ data=data-bin/xlm_pre_NN.ende
+ save_dir=ckpts/ende_xlm_translation
+ update_freq=28
+ max_tokens=1536
+ save_interval_updates=3000
+ xlm_translation
+ xlm=ckpts/mlm_ende_1024.pth
+ CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+ python train.py data-bin/xlm_pre_NN.ende --save-dir ckpts/ende_xlm_translation -s en -t de --arch big_xlm_translation --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 --lr 0.001 --min-lr 1e-09 --dropout 0.3 --weight-decay 0.0 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 --max-source-positions 512 --max-target-positions 512 --max-tokens 1536 --update-freq 28 --save-interval-updates 3000 --reload_xlm_ckpt ckpts/mlm_ende_1024.pth --encoder_trainable True
| NOTE: you may get better performance with: --ddp-backend=no_c10d
| NOTE: auto set: --ddp-backend=no_c10d
FAISS library was not found.
FAISS not available. Switching to standard nearest neighbors search implementation.
Namespace(adam_betas='(0.9, 0.98)', adam_eps=1e-08, adaptive_input=False, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, arch='big_xlm_translation', article_transformer_decoder_layer=False, attention_dropout=0.0, attn_fertility_method=None, bucket_cap_mb=25, clip_norm=0.0, continue_n_epoch=100000000, cpu=False, criterion='label_smoothed_cross_entropy', curriculum=0, data=['data-bin/xlm_pre_NN.ende'], ddp_backend='no_c10d', decoder_attention_heads=16, decoder_embed_dim=1024, decoder_embed_path=None, decoder_ffn_embed_dim=4096, decoder_input_dim=1024, decoder_layers=6, decoder_learned_pos=False, decoder_normalize_before=False, decoder_output_dim=1024, dependency_nn_layers=None, device_id=5, distributed_backend='nccl', distributed_init_method='tcp://localhost:17056', distributed_port=-1, distributed_rank=5, distributed_world_size=8, dropout=0.3, encoder_attention_heads=16, encoder_embed_dim=1024, encoder_embed_path=None, encoder_ffn_embed_dim=4096, encoder_layers=6, encoder_learned_pos=False, encoder_normalize_before=False, encoder_trainable='True', eval_gpu=99, extract_from_encoder_out_plus_word_embedding='False', extract_selfatt_layers=None, extracta_nodrop=False, extracta_path=None, fix_batches_to_gpus=False, fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, gold_data=None, gumbel_softmax_warm=None, keep_interval_updates=-1, keep_last_epochs=-1, label_smoothing=0.1, lambda_attention_weight_dependency=None, lambda_attn_entropy=None, lambda_attn_fertility_loss=None, lambda_enc_weight=None, lazy_load=False, left_pad_source='True', left_pad_target='False', log_format=None, log_interval=1000, lr=[0.001], lr_scheduler='inverse_sqrt', lr_shrink=0.1, max_a_method='', max_epoch=0, max_sentences=None, max_sentences_valid=None, max_source_positions=512, max_target_positions=512, max_tokens=1536, max_update=0, memory_efficient_fp16=False, min_loss_scale=0.0001, min_lr=1e-09, momentum=0.99, no_epoch_checkpoints=False, no_progress_bar=False, no_save=False, no_token_positional_embeddings=False, no_update_n_model=5, num_workers=0, optimizer='adam', optimizer_overrides='{}', output_extra_loss_ratio=False, raw_text=False, reload_xlm_ckpt='ckpts/mlm_ende_1024.pth', relu_dropout=0.0, required_batch_size_multiple=8, reset_lr_scheduler=False, reset_optimizer=False, restore_a_only_init=False, restore_NN_layer='', restore_file='checkpoint_last.pt', restore_max_a=False, restore_transformer='', save_dir='ckpts/ende_xlm_translation', save_interval=1, save_interval_updates=3000, seed=1, sentence_avg=False, share_all_embeddings=False, share_decoder_input_output_embed=False, skip_invalid_size_inputs_valid_test=False, sol3_lambda_dynamic_window_size=None, sol3_lambda_window_loss=None, sol3_window_size=None, sol4_dynamic2d=None, sol4_lambda_one2d=None, sol4_one2d=None, source_lang='en', src_act_path=None, store_attention_matrix='', target_lang='de', task='translation', tensorboard_logdir='', threshold_loss_scale=None, train_NN_layer=False, train_subset='train', update_freq=[28], upsample_primary=1, user_dir=None, valid_subset='valid', validate_interval=1, warmup_init_lr=1e-07, warmup_updates=4000, weight_decay=0.0)
| [en] dictionary: 64699 types
| [de] dictionary: 64699 types
| data-bin/xlm_pre_NN.ende train 1907028 examples
| data-bin/xlm_pre_NN.ende valid 508 examples
TransformerNNmentLayerModel(
  (encoder): XLM_Encoder(
    (model): TransformerModel(
      (position_embeddings): Embedding(512, 1024)
      (lang_embeddings): Embedding(2, 1024)
      (embeddings): Embedding(64699, 1024, padding_idx=2)
      (layer_norm_emb): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
      (attentions): ModuleList(
        (0): MultiHeadAttention(
          (q_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (k_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (v_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (out_lin): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (1): MultiHeadAttention(
          (q_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (k_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (v_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (out_lin): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (2): MultiHeadAttention(
          (q_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (k_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (v_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (out_lin): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (3): MultiHeadAttention(
          (q_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (k_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (v_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (out_lin): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (4): MultiHeadAttention(
          (q_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (k_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (v_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (out_lin): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (5): MultiHeadAttention(
          (q_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (k_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (v_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (out_lin): Linear(in_features=1024, out_features=1024, bias=True)
        )
      )
      (layer_norm1): ModuleList(
        (0): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (1): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (2): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (3): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (4): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (5): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
      )
      (ffns): ModuleList(
        (0): TransformerFFN(
          (lin1): Linear(in_features=1024, out_features=4096, bias=True)
          (lin2): Linear(in_features=4096, out_features=1024, bias=True)
        )
        (1): TransformerFFN(
          (lin1): Linear(in_features=1024, out_features=4096, bias=True)
          (lin2): Linear(in_features=4096, out_features=1024, bias=True)
        )
        (2): TransformerFFN(
          (lin1): Linear(in_features=1024, out_features=4096, bias=True)
          (lin2): Linear(in_features=4096, out_features=1024, bias=True)
        )
        (3): TransformerFFN(
          (lin1): Linear(in_features=1024, out_features=4096, bias=True)
          (lin2): Linear(in_features=4096, out_features=1024, bias=True)
        )
        (4): TransformerFFN(
          (lin1): Linear(in_features=1024, out_features=4096, bias=True)
          (lin2): Linear(in_features=4096, out_features=1024, bias=True)
        )
        (5): TransformerFFN(
          (lin1): Linear(in_features=1024, out_features=4096, bias=True)
          (lin2): Linear(in_features=4096, out_features=1024, bias=True)
        )
      )
      (layer_norm2): ModuleList(
        (0): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (1): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (2): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (3): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (4): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (5): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
      )
      (memories): ModuleDict()
      (pred_layer): PredLayer(
        (proj): Linear(in_features=1024, out_features=64699, bias=True)
      )
    )
  )
  (decoder): TransformerDecoder(
    (embed_tokens): Embedding(64699, 1024, padding_idx=2)
    (embed_positions): SinusoidalPositionalEmbedding()
    (layers): ModuleList(
      (0): TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
      )
      (1): TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
      )
      (2): TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
      )
      (3): TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
      )
      (4): TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
      )
      (5): TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
      )
    )
  )
)
| model big_xlm_translation, criterion LabelSmoothedCrossEntropyCriterion
| num. model params: 375705787 (num. trained: 375705787)
| Trainable params.encoder.model.position_embeddings.weight encoder.model.lang_embeddings.weight    encoder.model.embeddings.weight encoder.model.layer_norm_emb.weight encoder.model.layer_norm_emb.bias   encoder.model.attentions.0.q_lin.weight encoder.model.attentions.0.q_lin.bias   encoder.model.attentions.0.k_lin.weight encoder.model.attentions.0.k_lin.bias   encoder.model.attentions.0.v_lin.weight encoder.model.attentions.0.v_lin.bias   encoder.model.attentions.0.out_lin.weight   encoder.model.attentions.0.out_lin.bias encoder.model.attentions.1.q_lin.weight encoder.model.attentions.1.q_lin.bias   encoder.model.attentions.1.k_lin.weight encoder.model.attentions.1.k_lin.bias   encoder.model.attentions.1.v_lin.weight encoder.model.attentions.1.v_lin.bias   encoder.model.attentions.1.out_lin.weight   encoder.model.attentions.1.out_lin.bias encoder.model.attentions.2.q_lin.weight encoder.model.attentions.2.q_lin.bias   encoder.model.attentions.2.k_lin.weight encoder.model.attentions.2.k_lin.bias   encoder.model.attentions.2.v_lin.weight encoder.model.attentions.2.v_lin.bias   encoder.model.attentions.2.out_lin.weight   encoder.model.attentions.2.out_lin.bias encoder.model.attentions.3.q_lin.weight encoder.model.attentions.3.q_lin.bias   encoder.model.attentions.3.k_lin.weight encoder.model.attentions.3.k_lin.bias   encoder.model.attentions.3.v_lin.weight encoder.model.attentions.3.v_lin.bias   encoder.model.attentions.3.out_lin.weight   encoder.model.attentions.3.out_lin.bias encoder.model.attentions.4.q_lin.weight encoder.model.attentions.4.q_lin.bias   encoder.model.attentions.4.k_lin.weight encoder.model.attentions.4.k_lin.bias   encoder.model.attentions.4.v_lin.weight encoder.model.attentions.4.v_lin.bias   encoder.model.attentions.4.out_lin.weight   encoder.model.attentions.4.out_lin.bias encoder.model.attentions.5.q_lin.weight encoder.model.attentions.5.q_lin.bias   encoder.model.attentions.5.k_lin.weight encoder.model.attentions.5.k_lin.bias   encoder.model.attentions.5.v_lin.weight encoder.model.attentions.5.v_lin.bias   encoder.model.attentions.5.out_lin.weight   encoder.model.attentions.5.out_lin.bias encoder.model.layer_norm1.0.weight  encoder.model.layer_norm1.0.bias    encoder.model.layer_norm1.1.weight  encoder.model.layer_norm1.1.bias    encoder.model.layer_norm1.2.weight  encoder.model.layer_norm1.2.bias    encoder.model.layer_norm1.3.weight  encoder.model.layer_norm1.3.bias    encoder.model.layer_norm1.4.weight  encoder.model.layer_norm1.4.bias    encoder.model.layer_norm1.5.weight  encoder.model.layer_norm1.5.bias    encoder.model.ffns.0.lin1.weight    encoder.model.ffns.0.lin1.bias  encoder.model.ffns.0.lin2.weight    encoder.model.ffns.0.lin2.bias  encoder.model.ffns.1.lin1.weight    encoder.model.ffns.1.lin1.bias  encoder.model.ffns.1.lin2.weight    encoder.model.ffns.1.lin2.bias  encoder.model.ffns.2.lin1.weight    encoder.model.ffns.2.lin1.bias  encoder.model.ffns.2.lin2.weight    encoder.model.ffns.2.lin2.bias  encoder.model.ffns.3.lin1.weight    encoder.model.ffns.3.lin1.bias  encoder.model.ffns.3.lin2.weight    encoder.model.ffns.3.lin2.bias  encoder.model.ffns.4.lin1.weight    encoder.model.ffns.4.lin1.bias  encoder.model.ffns.4.lin2.weight    encoder.model.ffns.4.lin2.bias  encoder.model.ffns.5.lin1.weight    encoder.model.ffns.5.lin1.bias  encoder.model.ffns.5.lin2.weight    encoder.model.ffns.5.lin2.bias  encoder.model.layer_norm2.0.weight  encoder.model.layer_norm2.0.bias    encoder.model.layer_norm2.1.weight  encoder.model.layer_norm2.1.bias    encoder.model.layer_norm2.2.weight  encoder.model.layer_norm2.2.bias    encoder.model.layer_norm2.3.weight  encoder.model.layer_norm2.3.bias    encoder.model.layer_norm2.4.weight  encoder.model.layer_norm2.4.bias    encoder.model.layer_norm2.5.weight  encoder.model.layer_norm2.5.bias    encoder.model.pred_layer.proj.bias  decoder.embed_out   decoder.embed_tokens.weight decoder.layers.0.self_attn.in_proj_weight   decoder.layers.0.self_attn.in_proj_bias decoder.layers.0.self_attn.out_proj.weight  decoder.layers.0.self_attn.out_proj.bias    decoder.layers.0.self_attn_layer_norm.weight    decoder.layers.0.self_attn_layer_norm.bias  decoder.layers.0.encoder_attn.in_proj_weight    decoder.layers.0.encoder_attn.in_proj_bias  decoder.layers.0.encoder_attn.out_proj.weight   decoder.layers.0.encoder_attn.out_proj.bias decoder.layers.0.encoder_attn_layer_norm.weight decoder.layers.0.encoder_attn_layer_norm.bias   decoder.layers.0.fc1.weight decoder.layers.0.fc1.bias   decoder.layers.0.fc2.weight decoder.layers.0.fc2.bias   decoder.layers.0.final_layer_norm.weight    decoder.layers.0.final_layer_norm.bias  decoder.layers.1.self_attn.in_proj_weight   decoder.layers.1.self_attn.in_proj_bias decoder.layers.1.self_attn.out_proj.weight  decoder.layers.1.self_attn.out_proj.bias    decoder.layers.1.self_attn_layer_norm.weight    decoder.layers.1.self_attn_layer_norm.bias  decoder.layers.1.encoder_attn.in_proj_weight    decoder.layers.1.encoder_attn.in_proj_bias  decoder.layers.1.encoder_attn.out_proj.weight   decoder.layers.1.encoder_attn.out_proj.bias decoder.layers.1.encoder_attn_layer_norm.weight decoder.layers.1.encoder_attn_layer_norm.bias   decoder.layers.1.fc1.weight decoder.layers.1.fc1.bias   decoder.layers.1.fc2.weight decoder.layers.1.fc2.bias   decoder.layers.1.final_layer_norm.weight    decoder.layers.1.final_layer_norm.bias  decoder.layers.2.self_attn.in_proj_weight   decoder.layers.2.self_attn.in_proj_bias decoder.layers.2.self_attn.out_proj.weight  decoder.layers.2.self_attn.out_proj.bias    decoder.layers.2.self_attn_layer_norm.weight    decoder.layers.2.self_attn_layer_norm.bias  decoder.layers.2.encoder_attn.in_proj_weight    decoder.layers.2.encoder_attn.in_proj_bias  decoder.layers.2.encoder_attn.out_proj.weight   decoder.layers.2.encoder_attn.out_proj.bias decoder.layers.2.encoder_attn_layer_norm.weight decoder.layers.2.encoder_attn_layer_norm.bias   decoder.layers.2.fc1.weight decoder.layers.2.fc1.bias   decoder.layers.2.fc2.weight decoder.layers.2.fc2.bias   decoder.layers.2.final_layer_norm.weight    decoder.layers.2.final_layer_norm.bias  decoder.layers.3.self_attn.in_proj_weight   decoder.layers.3.self_attn.in_proj_bias decoder.layers.3.self_attn.out_proj.weight  decoder.layers.3.self_attn.out_proj.bias    decoder.layers.3.self_attn_layer_norm.weight    decoder.layers.3.self_attn_layer_norm.bias  decoder.layers.3.encoder_attn.in_proj_weight    decoder.layers.3.encoder_attn.in_proj_bias  decoder.layers.3.encoder_attn.out_proj.weight   decoder.layers.3.encoder_attn.out_proj.bias decoder.layers.3.encoder_attn_layer_norm.weight decoder.layers.3.encoder_attn_layer_norm.bias   decoder.layers.3.fc1.weight decoder.layers.3.fc1.bias   decoder.layers.3.fc2.weight decoder.layers.3.fc2.bias   decoder.layers.3.final_layer_norm.weight    decoder.layers.3.final_layer_norm.bias  decoder.layers.4.self_attn.in_proj_weight   decoder.layers.4.self_attn.in_proj_bias decoder.layers.4.self_attn.out_proj.weight  decoder.layers.4.self_attn.out_proj.bias    decoder.layers.4.self_attn_layer_norm.weight    decoder.layers.4.self_attn_layer_norm.bias  decoder.layers.4.encoder_attn.in_proj_weight    decoder.layers.4.encoder_attn.in_proj_bias  decoder.layers.4.encoder_attn.out_proj.weight   decoder.layers.4.encoder_attn.out_proj.bias decoder.layers.4.encoder_attn_layer_norm.weight decoder.layers.4.encoder_attn_layer_norm.bias   decoder.layers.4.fc1.weight decoder.layers.4.fc1.bias   decoder.layers.4.fc2.weight decoder.layers.4.fc2.bias   decoder.layers.4.final_layer_norm.weight    decoder.layers.4.final_layer_norm.bias  decoder.layers.5.self_attn.in_proj_weight   decoder.layers.5.self_attn.in_proj_bias decoder.layers.5.self_attn.out_proj.weight  decoder.layers.5.self_attn.out_proj.bias    decoder.layers.5.self_attn_layer_norm.weight    decoder.layers.5.self_attn_layer_norm.bias  decoder.layers.5.encoder_attn.in_proj_weight    decoder.layers.5.encoder_attn.in_proj_bias  decoder.layers.5.encoder_attn.out_proj.weight   decoder.layers.5.encoder_attn.out_proj.bias decoder.layers.5.encoder_attn_layer_norm.weight decoder.layers.5.encoder_attn_layer_norm.bias   decoder.layers.5.fc1.weight decoder.layers.5.fc1.bias   decoder.layers.5.fc2.weight decoder.layers.5.fc2.bias   decoder.layers.5.final_layer_norm.weight    decoder.layers.5.final_layer_norm.bias
| training on 8 GPUs
| max tokens per GPU = 1536 and max sentences per GPU = None
| WARNING: 20 samples have invalid sizes and will be skipped, max_positions=(512, 512), first few sample ids=[1464114, 1797546, 1513624, 1365715, 1841665, 1553144, 1795797, 1643331, 1583565, 1704988]
| distributed init (rank 5): tcp://localhost:17056
Namespace(adam_betas='(0.9, 0.98)', adam_eps=1e-08, adaptive_input=False, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, arch='big_xlm_translation', article_transformer_decoder_layer=False, attention_dropout=0.0, attn_fertility_method=None, bucket_cap_mb=25, clip_norm=0.0, continue_n_epoch=100000000, cpu=False, criterion='label_smoothed_cross_entropy', curriculum=0, data=['data-bin/xlm_pre_NN.ende'], ddp_backend='no_c10d', decoder_attention_heads=16, decoder_embed_dim=1024, decoder_embed_path=None, decoder_ffn_embed_dim=4096, decoder_input_dim=1024, decoder_layers=6, decoder_learned_pos=False, decoder_normalize_before=False, decoder_output_dim=1024, dependency_nn_layers=None, device_id=1, distributed_backend='nccl', distributed_init_method='tcp://localhost:17056', distributed_port=-1, distributed_rank=1, distributed_world_size=8, dropout=0.3, encoder_attention_heads=16, encoder_embed_dim=1024, encoder_embed_path=None, encoder_ffn_embed_dim=4096, encoder_layers=6, encoder_learned_pos=False, encoder_normalize_before=False, encoder_trainable='True', eval_gpu=99, extract_from_encoder_out_plus_word_embedding='False', extract_selfatt_layers=None, extracta_nodrop=False, extracta_path=None, fix_batches_to_gpus=False, fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, gold_data=None, gumbel_softmax_warm=None, keep_interval_updates=-1, keep_last_epochs=-1, label_smoothing=0.1, lambda_attention_weight_dependency=None, lambda_attn_entropy=None, lambda_attn_fertility_loss=None, lambda_enc_weight=None, lazy_load=False, left_pad_source='True', left_pad_target='False', log_format=None, log_interval=1000, lr=[0.001], lr_scheduler='inverse_sqrt', lr_shrink=0.1, max_a_method='', max_epoch=0, max_sentences=None, max_sentences_valid=None, max_source_positions=512, max_target_positions=512, max_tokens=1536, max_update=0, memory_efficient_fp16=False, min_loss_scale=0.0001, min_lr=1e-09, momentum=0.99, no_epoch_checkpoints=False, no_progress_bar=False, no_save=False, no_token_positional_embeddings=False, no_update_n_model=5, num_workers=0, optimizer='adam', optimizer_overrides='{}', output_extra_loss_ratio=False, raw_text=False, reload_xlm_ckpt='ckpts/mlm_ende_1024.pth', relu_dropout=0.0, required_batch_size_multiple=8, reset_lr_scheduler=False, reset_optimizer=False, restore_a_only_init=False, restore_NN_layer='', restore_file='checkpoint_last.pt', restore_max_a=False, restore_transformer='', save_dir='ckpts/ende_xlm_translation', save_interval=1, save_interval_updates=3000, seed=1, sentence_avg=False, share_all_embeddings=False, share_decoder_input_output_embed=False, skip_invalid_size_inputs_valid_test=False, sol3_lambda_dynamic_window_size=None, sol3_lambda_window_loss=None, sol3_window_size=None, sol4_dynamic2d=None, sol4_lambda_one2d=None, sol4_one2d=None, source_lang='en', src_act_path=None, store_attention_matrix='', target_lang='de', task='translation', tensorboard_logdir='', threshold_loss_scale=None, train_NN_layer=False, train_subset='train', update_freq=[28], upsample_primary=1, user_dir=None, valid_subset='valid', validate_interval=1, warmup_init_lr=1e-07, warmup_updates=4000, weight_decay=0.0)
| [en] dictionary: 64699 types
| [de] dictionary: 64699 types
| data-bin/xlm_pre_NN.ende train 1907028 examples
| data-bin/xlm_pre_NN.ende valid 508 examples
TransformerNNmentLayerModel(
  (encoder): XLM_Encoder(
    (model): TransformerModel(
      (position_embeddings): Embedding(512, 1024)
      (lang_embeddings): Embedding(2, 1024)
      (embeddings): Embedding(64699, 1024, padding_idx=2)
      (layer_norm_emb): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
      (attentions): ModuleList(
        (0): MultiHeadAttention(
          (q_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (k_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (v_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (out_lin): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (1): MultiHeadAttention(
          (q_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (k_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (v_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (out_lin): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (2): MultiHeadAttention(
          (q_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (k_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (v_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (out_lin): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (3): MultiHeadAttention(
          (q_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (k_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (v_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (out_lin): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (4): MultiHeadAttention(
          (q_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (k_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (v_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (out_lin): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (5): MultiHeadAttention(
          (q_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (k_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (v_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (out_lin): Linear(in_features=1024, out_features=1024, bias=True)
        )
      )
      (layer_norm1): ModuleList(
        (0): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (1): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (2): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (3): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (4): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (5): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
      )
      (ffns): ModuleList(
        (0): TransformerFFN(
          (lin1): Linear(in_features=1024, out_features=4096, bias=True)
          (lin2): Linear(in_features=4096, out_features=1024, bias=True)
        )
        (1): TransformerFFN(
          (lin1): Linear(in_features=1024, out_features=4096, bias=True)
          (lin2): Linear(in_features=4096, out_features=1024, bias=True)
        )
        (2): TransformerFFN(
          (lin1): Linear(in_features=1024, out_features=4096, bias=True)
          (lin2): Linear(in_features=4096, out_features=1024, bias=True)
        )
        (3): TransformerFFN(
          (lin1): Linear(in_features=1024, out_features=4096, bias=True)
          (lin2): Linear(in_features=4096, out_features=1024, bias=True)
        )
        (4): TransformerFFN(
          (lin1): Linear(in_features=1024, out_features=4096, bias=True)
          (lin2): Linear(in_features=4096, out_features=1024, bias=True)
        )
        (5): TransformerFFN(
          (lin1): Linear(in_features=1024, out_features=4096, bias=True)
          (lin2): Linear(in_features=4096, out_features=1024, bias=True)
        )
      )
      (layer_norm2): ModuleList(
        (0): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (1): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (2): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (3): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (4): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (5): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
      )
      (memories): ModuleDict()
      (pred_layer): PredLayer(
        (proj): Linear(in_features=1024, out_features=64699, bias=True)
      )
    )
  )
  (decoder): TransformerDecoder(
    (embed_tokens): Embedding(64699, 1024, padding_idx=2)
    (embed_positions): SinusoidalPositionalEmbedding()
    (layers): ModuleList(
      (0): TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
      )
      (1): TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
      )
      (2): TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
      )
      (3): TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
      )
      (4): TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
      )
      (5): TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
      )
    )
  )
)
| model big_xlm_translation, criterion LabelSmoothedCrossEntropyCriterion
| num. model params: 375705787 (num. trained: 375705787)
| Trainable params.encoder.model.position_embeddings.weight encoder.model.lang_embeddings.weight    encoder.model.embeddings.weight encoder.model.layer_norm_emb.weight encoder.model.layer_norm_emb.bias   encoder.model.attentions.0.q_lin.weight encoder.model.attentions.0.q_lin.bias   encoder.model.attentions.0.k_lin.weight encoder.model.attentions.0.k_lin.bias   encoder.model.attentions.0.v_lin.weight encoder.model.attentions.0.v_lin.bias   encoder.model.attentions.0.out_lin.weight   encoder.model.attentions.0.out_lin.bias encoder.model.attentions.1.q_lin.weight encoder.model.attentions.1.q_lin.bias   encoder.model.attentions.1.k_lin.weight encoder.model.attentions.1.k_lin.bias   encoder.model.attentions.1.v_lin.weight encoder.model.attentions.1.v_lin.bias   encoder.model.attentions.1.out_lin.weight   encoder.model.attentions.1.out_lin.bias encoder.model.attentions.2.q_lin.weight encoder.model.attentions.2.q_lin.bias   encoder.model.attentions.2.k_lin.weight encoder.model.attentions.2.k_lin.bias   encoder.model.attentions.2.v_lin.weight encoder.model.attentions.2.v_lin.bias   encoder.model.attentions.2.out_lin.weight   encoder.model.attentions.2.out_lin.bias encoder.model.attentions.3.q_lin.weight encoder.model.attentions.3.q_lin.bias   encoder.model.attentions.3.k_lin.weight encoder.model.attentions.3.k_lin.bias   encoder.model.attentions.3.v_lin.weight encoder.model.attentions.3.v_lin.bias   encoder.model.attentions.3.out_lin.weight   encoder.model.attentions.3.out_lin.bias encoder.model.attentions.4.q_lin.weight encoder.model.attentions.4.q_lin.bias   encoder.model.attentions.4.k_lin.weight encoder.model.attentions.4.k_lin.bias   encoder.model.attentions.4.v_lin.weight encoder.model.attentions.4.v_lin.bias   encoder.model.attentions.4.out_lin.weight   encoder.model.attentions.4.out_lin.bias encoder.model.attentions.5.q_lin.weight encoder.model.attentions.5.q_lin.bias   encoder.model.attentions.5.k_lin.weight encoder.model.attentions.5.k_lin.bias   encoder.model.attentions.5.v_lin.weight encoder.model.attentions.5.v_lin.bias   encoder.model.attentions.5.out_lin.weight   encoder.model.attentions.5.out_lin.bias encoder.model.layer_norm1.0.weight  encoder.model.layer_norm1.0.bias    encoder.model.layer_norm1.1.weight  encoder.model.layer_norm1.1.bias    encoder.model.layer_norm1.2.weight  encoder.model.layer_norm1.2.bias    encoder.model.layer_norm1.3.weight  encoder.model.layer_norm1.3.bias    encoder.model.layer_norm1.4.weight  encoder.model.layer_norm1.4.bias    encoder.model.layer_norm1.5.weight  encoder.model.layer_norm1.5.bias    encoder.model.ffns.0.lin1.weight    encoder.model.ffns.0.lin1.bias  encoder.model.ffns.0.lin2.weight    encoder.model.ffns.0.lin2.bias  encoder.model.ffns.1.lin1.weight    encoder.model.ffns.1.lin1.bias  encoder.model.ffns.1.lin2.weight    encoder.model.ffns.1.lin2.bias  encoder.model.ffns.2.lin1.weight    encoder.model.ffns.2.lin1.bias  encoder.model.ffns.2.lin2.weight    encoder.model.ffns.2.lin2.bias  encoder.model.ffns.3.lin1.weight    encoder.model.ffns.3.lin1.bias  encoder.model.ffns.3.lin2.weight    encoder.model.ffns.3.lin2.bias  encoder.model.ffns.4.lin1.weight    encoder.model.ffns.4.lin1.bias  encoder.model.ffns.4.lin2.weight    encoder.model.ffns.4.lin2.bias  encoder.model.ffns.5.lin1.weight    encoder.model.ffns.5.lin1.bias  encoder.model.ffns.5.lin2.weight    encoder.model.ffns.5.lin2.bias  encoder.model.layer_norm2.0.weight  encoder.model.layer_norm2.0.bias    encoder.model.layer_norm2.1.weight  encoder.model.layer_norm2.1.bias    encoder.model.layer_norm2.2.weight  encoder.model.layer_norm2.2.bias    encoder.model.layer_norm2.3.weight  encoder.model.layer_norm2.3.bias    encoder.model.layer_norm2.4.weight  encoder.model.layer_norm2.4.bias    encoder.model.layer_norm2.5.weight  encoder.model.layer_norm2.5.bias    encoder.model.pred_layer.proj.bias  decoder.embed_out   decoder.embed_tokens.weight decoder.layers.0.self_attn.in_proj_weight   decoder.layers.0.self_attn.in_proj_bias decoder.layers.0.self_attn.out_proj.weight  decoder.layers.0.self_attn.out_proj.bias    decoder.layers.0.self_attn_layer_norm.weight    decoder.layers.0.self_attn_layer_norm.bias  decoder.layers.0.encoder_attn.in_proj_weight    decoder.layers.0.encoder_attn.in_proj_bias  decoder.layers.0.encoder_attn.out_proj.weight   decoder.layers.0.encoder_attn.out_proj.bias decoder.layers.0.encoder_attn_layer_norm.weight decoder.layers.0.encoder_attn_layer_norm.bias   decoder.layers.0.fc1.weight decoder.layers.0.fc1.bias   decoder.layers.0.fc2.weight decoder.layers.0.fc2.bias   decoder.layers.0.final_layer_norm.weight    decoder.layers.0.final_layer_norm.bias  decoder.layers.1.self_attn.in_proj_weight   decoder.layers.1.self_attn.in_proj_bias decoder.layers.1.self_attn.out_proj.weight  decoder.layers.1.self_attn.out_proj.bias    decoder.layers.1.self_attn_layer_norm.weight    decoder.layers.1.self_attn_layer_norm.bias  decoder.layers.1.encoder_attn.in_proj_weight    decoder.layers.1.encoder_attn.in_proj_bias  decoder.layers.1.encoder_attn.out_proj.weight   decoder.layers.1.encoder_attn.out_proj.bias decoder.layers.1.encoder_attn_layer_norm.weight decoder.layers.1.encoder_attn_layer_norm.bias   decoder.layers.1.fc1.weight decoder.layers.1.fc1.bias   decoder.layers.1.fc2.weight decoder.layers.1.fc2.bias   decoder.layers.1.final_layer_norm.weight    decoder.layers.1.final_layer_norm.bias  decoder.layers.2.self_attn.in_proj_weight   decoder.layers.2.self_attn.in_proj_bias decoder.layers.2.self_attn.out_proj.weight  decoder.layers.2.self_attn.out_proj.bias    decoder.layers.2.self_attn_layer_norm.weight    decoder.layers.2.self_attn_layer_norm.bias  decoder.layers.2.encoder_attn.in_proj_weight    decoder.layers.2.encoder_attn.in_proj_bias  decoder.layers.2.encoder_attn.out_proj.weight   decoder.layers.2.encoder_attn.out_proj.bias decoder.layers.2.encoder_attn_layer_norm.weight decoder.layers.2.encoder_attn_layer_norm.bias   decoder.layers.2.fc1.weight decoder.layers.2.fc1.bias   decoder.layers.2.fc2.weight decoder.layers.2.fc2.bias   decoder.layers.2.final_layer_norm.weight    decoder.layers.2.final_layer_norm.bias  decoder.layers.3.self_attn.in_proj_weight   decoder.layers.3.self_attn.in_proj_bias decoder.layers.3.self_attn.out_proj.weight  decoder.layers.3.self_attn.out_proj.bias    decoder.layers.3.self_attn_layer_norm.weight    decoder.layers.3.self_attn_layer_norm.bias  decoder.layers.3.encoder_attn.in_proj_weight    decoder.layers.3.encoder_attn.in_proj_bias  decoder.layers.3.encoder_attn.out_proj.weight   decoder.layers.3.encoder_attn.out_proj.bias decoder.layers.3.encoder_attn_layer_norm.weight decoder.layers.3.encoder_attn_layer_norm.bias   decoder.layers.3.fc1.weight decoder.layers.3.fc1.bias   decoder.layers.3.fc2.weight decoder.layers.3.fc2.bias   decoder.layers.3.final_layer_norm.weight    decoder.layers.3.final_layer_norm.bias  decoder.layers.4.self_attn.in_proj_weight   decoder.layers.4.self_attn.in_proj_bias decoder.layers.4.self_attn.out_proj.weight  decoder.layers.4.self_attn.out_proj.bias    decoder.layers.4.self_attn_layer_norm.weight    decoder.layers.4.self_attn_layer_norm.bias  decoder.layers.4.encoder_attn.in_proj_weight    decoder.layers.4.encoder_attn.in_proj_bias  decoder.layers.4.encoder_attn.out_proj.weight   decoder.layers.4.encoder_attn.out_proj.bias decoder.layers.4.encoder_attn_layer_norm.weight decoder.layers.4.encoder_attn_layer_norm.bias   decoder.layers.4.fc1.weight decoder.layers.4.fc1.bias   decoder.layers.4.fc2.weight decoder.layers.4.fc2.bias   decoder.layers.4.final_layer_norm.weight    decoder.layers.4.final_layer_norm.bias  decoder.layers.5.self_attn.in_proj_weight   decoder.layers.5.self_attn.in_proj_bias decoder.layers.5.self_attn.out_proj.weight  decoder.layers.5.self_attn.out_proj.bias    decoder.layers.5.self_attn_layer_norm.weight    decoder.layers.5.self_attn_layer_norm.bias  decoder.layers.5.encoder_attn.in_proj_weight    decoder.layers.5.encoder_attn.in_proj_bias  decoder.layers.5.encoder_attn.out_proj.weight   decoder.layers.5.encoder_attn.out_proj.bias decoder.layers.5.encoder_attn_layer_norm.weight decoder.layers.5.encoder_attn_layer_norm.bias   decoder.layers.5.fc1.weight decoder.layers.5.fc1.bias   decoder.layers.5.fc2.weight decoder.layers.5.fc2.bias   decoder.layers.5.final_layer_norm.weight    decoder.layers.5.final_layer_norm.bias
| training on 8 GPUs
| max tokens per GPU = 1536 and max sentences per GPU = None
| WARNING: 20 samples have invalid sizes and will be skipped, max_positions=(512, 512), first few sample ids=[1464114, 1797546, 1513624, 1365715, 1841665, 1553144, 1795797, 1643331, 1583565, 1704988]
| distributed init (rank 1): tcp://localhost:17056
Namespace(adam_betas='(0.9, 0.98)', adam_eps=1e-08, adaptive_input=False, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, arch='big_xlm_translation', article_transformer_decoder_layer=False, attention_dropout=0.0, attn_fertility_method=None, bucket_cap_mb=25, clip_norm=0.0, continue_n_epoch=100000000, cpu=False, criterion='label_smoothed_cross_entropy', curriculum=0, data=['data-bin/xlm_pre_NN.ende'], ddp_backend='no_c10d', decoder_attention_heads=16, decoder_embed_dim=1024, decoder_embed_path=None, decoder_ffn_embed_dim=4096, decoder_input_dim=1024, decoder_layers=6, decoder_learned_pos=False, decoder_normalize_before=False, decoder_output_dim=1024, dependency_nn_layers=None, device_id=7, distributed_backend='nccl', distributed_init_method='tcp://localhost:17056', distributed_port=-1, distributed_rank=7, distributed_world_size=8, dropout=0.3, encoder_attention_heads=16, encoder_embed_dim=1024, encoder_embed_path=None, encoder_ffn_embed_dim=4096, encoder_layers=6, encoder_learned_pos=False, encoder_normalize_before=False, encoder_trainable='True', eval_gpu=99, extract_from_encoder_out_plus_word_embedding='False', extract_selfatt_layers=None, extracta_nodrop=False, extracta_path=None, fix_batches_to_gpus=False, fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, gold_data=None, gumbel_softmax_warm=None, keep_interval_updates=-1, keep_last_epochs=-1, label_smoothing=0.1, lambda_attention_weight_dependency=None, lambda_attn_entropy=None, lambda_attn_fertility_loss=None, lambda_enc_weight=None, lazy_load=False, left_pad_source='True', left_pad_target='False', log_format=None, log_interval=1000, lr=[0.001], lr_scheduler='inverse_sqrt', lr_shrink=0.1, max_a_method='', max_epoch=0, max_sentences=None, max_sentences_valid=None, max_source_positions=512, max_target_positions=512, max_tokens=1536, max_update=0, memory_efficient_fp16=False, min_loss_scale=0.0001, min_lr=1e-09, momentum=0.99, no_epoch_checkpoints=False, no_progress_bar=False, no_save=False, no_token_positional_embeddings=False, no_update_n_model=5, num_workers=0, optimizer='adam', optimizer_overrides='{}', output_extra_loss_ratio=False, raw_text=False, reload_xlm_ckpt='ckpts/mlm_ende_1024.pth', relu_dropout=0.0, required_batch_size_multiple=8, reset_lr_scheduler=False, reset_optimizer=False, restore_a_only_init=False, restore_NN_layer='', restore_file='checkpoint_last.pt', restore_max_a=False, restore_transformer='', save_dir='ckpts/ende_xlm_translation', save_interval=1, save_interval_updates=3000, seed=1, sentence_avg=False, share_all_embeddings=False, share_decoder_input_output_embed=False, skip_invalid_size_inputs_valid_test=False, sol3_lambda_dynamic_window_size=None, sol3_lambda_window_loss=None, sol3_window_size=None, sol4_dynamic2d=None, sol4_lambda_one2d=None, sol4_one2d=None, source_lang='en', src_act_path=None, store_attention_matrix='', target_lang='de', task='translation', tensorboard_logdir='', threshold_loss_scale=None, train_NN_layer=False, train_subset='train', update_freq=[28], upsample_primary=1, user_dir=None, valid_subset='valid', validate_interval=1, warmup_init_lr=1e-07, warmup_updates=4000, weight_decay=0.0)
| [en] dictionary: 64699 types
| [de] dictionary: 64699 types
| data-bin/xlm_pre_NN.ende train 1907028 examples
| data-bin/xlm_pre_NN.ende valid 508 examples
TransformerNNmentLayerModel(
  (encoder): XLM_Encoder(
    (model): TransformerModel(
      (position_embeddings): Embedding(512, 1024)
      (lang_embeddings): Embedding(2, 1024)
      (embeddings): Embedding(64699, 1024, padding_idx=2)
      (layer_norm_emb): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
      (attentions): ModuleList(
        (0): MultiHeadAttention(
          (q_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (k_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (v_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (out_lin): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (1): MultiHeadAttention(
          (q_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (k_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (v_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (out_lin): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (2): MultiHeadAttention(
          (q_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (k_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (v_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (out_lin): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (3): MultiHeadAttention(
          (q_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (k_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (v_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (out_lin): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (4): MultiHeadAttention(
          (q_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (k_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (v_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (out_lin): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (5): MultiHeadAttention(
          (q_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (k_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (v_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (out_lin): Linear(in_features=1024, out_features=1024, bias=True)
        )
      )
      (layer_norm1): ModuleList(
        (0): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (1): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (2): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (3): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (4): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (5): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
      )
      (ffns): ModuleList(
        (0): TransformerFFN(
          (lin1): Linear(in_features=1024, out_features=4096, bias=True)
          (lin2): Linear(in_features=4096, out_features=1024, bias=True)
        )
        (1): TransformerFFN(
          (lin1): Linear(in_features=1024, out_features=4096, bias=True)
          (lin2): Linear(in_features=4096, out_features=1024, bias=True)
        )
        (2): TransformerFFN(
          (lin1): Linear(in_features=1024, out_features=4096, bias=True)
          (lin2): Linear(in_features=4096, out_features=1024, bias=True)
        )
        (3): TransformerFFN(
          (lin1): Linear(in_features=1024, out_features=4096, bias=True)
          (lin2): Linear(in_features=4096, out_features=1024, bias=True)
        )
        (4): TransformerFFN(
          (lin1): Linear(in_features=1024, out_features=4096, bias=True)
          (lin2): Linear(in_features=4096, out_features=1024, bias=True)
        )
        (5): TransformerFFN(
          (lin1): Linear(in_features=1024, out_features=4096, bias=True)
          (lin2): Linear(in_features=4096, out_features=1024, bias=True)
        )
      )
      (layer_norm2): ModuleList(
        (0): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (1): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (2): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (3): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (4): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (5): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
      )
      (memories): ModuleDict()
      (pred_layer): PredLayer(
        (proj): Linear(in_features=1024, out_features=64699, bias=True)
      )
    )
  )
  (decoder): TransformerDecoder(
    (embed_tokens): Embedding(64699, 1024, padding_idx=2)
    (embed_positions): SinusoidalPositionalEmbedding()
    (layers): ModuleList(
      (0): TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
      )
      (1): TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
      )
      (2): TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
      )
      (3): TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
      )
      (4): TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
      )
      (5): TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
      )
    )
  )
)
| model big_xlm_translation, criterion LabelSmoothedCrossEntropyCriterion
| num. model params: 375705787 (num. trained: 375705787)
| Trainable params.encoder.model.position_embeddings.weight encoder.model.lang_embeddings.weight    encoder.model.embeddings.weight encoder.model.layer_norm_emb.weight encoder.model.layer_norm_emb.bias   encoder.model.attentions.0.q_lin.weight encoder.model.attentions.0.q_lin.bias   encoder.model.attentions.0.k_lin.weight encoder.model.attentions.0.k_lin.bias   encoder.model.attentions.0.v_lin.weight encoder.model.attentions.0.v_lin.bias   encoder.model.attentions.0.out_lin.weight   encoder.model.attentions.0.out_lin.bias encoder.model.attentions.1.q_lin.weight encoder.model.attentions.1.q_lin.bias   encoder.model.attentions.1.k_lin.weight encoder.model.attentions.1.k_lin.bias   encoder.model.attentions.1.v_lin.weight encoder.model.attentions.1.v_lin.bias   encoder.model.attentions.1.out_lin.weight   encoder.model.attentions.1.out_lin.bias encoder.model.attentions.2.q_lin.weight encoder.model.attentions.2.q_lin.bias   encoder.model.attentions.2.k_lin.weight encoder.model.attentions.2.k_lin.bias   encoder.model.attentions.2.v_lin.weight encoder.model.attentions.2.v_lin.bias   encoder.model.attentions.2.out_lin.weight   encoder.model.attentions.2.out_lin.bias encoder.model.attentions.3.q_lin.weight encoder.model.attentions.3.q_lin.bias   encoder.model.attentions.3.k_lin.weight encoder.model.attentions.3.k_lin.bias   encoder.model.attentions.3.v_lin.weight encoder.model.attentions.3.v_lin.bias   encoder.model.attentions.3.out_lin.weight   encoder.model.attentions.3.out_lin.bias encoder.model.attentions.4.q_lin.weight encoder.model.attentions.4.q_lin.bias   encoder.model.attentions.4.k_lin.weight encoder.model.attentions.4.k_lin.bias   encoder.model.attentions.4.v_lin.weight encoder.model.attentions.4.v_lin.bias   encoder.model.attentions.4.out_lin.weight   encoder.model.attentions.4.out_lin.bias encoder.model.attentions.5.q_lin.weight encoder.model.attentions.5.q_lin.bias   encoder.model.attentions.5.k_lin.weight encoder.model.attentions.5.k_lin.bias   encoder.model.attentions.5.v_lin.weight encoder.model.attentions.5.v_lin.bias   encoder.model.attentions.5.out_lin.weight   encoder.model.attentions.5.out_lin.bias encoder.model.layer_norm1.0.weight  encoder.model.layer_norm1.0.bias    encoder.model.layer_norm1.1.weight  encoder.model.layer_norm1.1.bias    encoder.model.layer_norm1.2.weight  encoder.model.layer_norm1.2.bias    encoder.model.layer_norm1.3.weight  encoder.model.layer_norm1.3.bias    encoder.model.layer_norm1.4.weight  encoder.model.layer_norm1.4.bias    encoder.model.layer_norm1.5.weight  encoder.model.layer_norm1.5.bias    encoder.model.ffns.0.lin1.weight    encoder.model.ffns.0.lin1.bias  encoder.model.ffns.0.lin2.weight    encoder.model.ffns.0.lin2.bias  encoder.model.ffns.1.lin1.weight    encoder.model.ffns.1.lin1.bias  encoder.model.ffns.1.lin2.weight    encoder.model.ffns.1.lin2.bias  encoder.model.ffns.2.lin1.weight    encoder.model.ffns.2.lin1.bias  encoder.model.ffns.2.lin2.weight    encoder.model.ffns.2.lin2.bias  encoder.model.ffns.3.lin1.weight    encoder.model.ffns.3.lin1.bias  encoder.model.ffns.3.lin2.weight    encoder.model.ffns.3.lin2.bias  encoder.model.ffns.4.lin1.weight    encoder.model.ffns.4.lin1.bias  encoder.model.ffns.4.lin2.weight    encoder.model.ffns.4.lin2.bias  encoder.model.ffns.5.lin1.weight    encoder.model.ffns.5.lin1.bias  encoder.model.ffns.5.lin2.weight    encoder.model.ffns.5.lin2.bias  encoder.model.layer_norm2.0.weight  encoder.model.layer_norm2.0.bias    encoder.model.layer_norm2.1.weight  encoder.model.layer_norm2.1.bias    encoder.model.layer_norm2.2.weight  encoder.model.layer_norm2.2.bias    encoder.model.layer_norm2.3.weight  encoder.model.layer_norm2.3.bias    encoder.model.layer_norm2.4.weight  encoder.model.layer_norm2.4.bias    encoder.model.layer_norm2.5.weight  encoder.model.layer_norm2.5.bias    encoder.model.pred_layer.proj.bias  decoder.embed_out   decoder.embed_tokens.weight decoder.layers.0.self_attn.in_proj_weight   decoder.layers.0.self_attn.in_proj_bias decoder.layers.0.self_attn.out_proj.weight  decoder.layers.0.self_attn.out_proj.bias    decoder.layers.0.self_attn_layer_norm.weight    decoder.layers.0.self_attn_layer_norm.bias  decoder.layers.0.encoder_attn.in_proj_weight    decoder.layers.0.encoder_attn.in_proj_bias  decoder.layers.0.encoder_attn.out_proj.weight   decoder.layers.0.encoder_attn.out_proj.bias decoder.layers.0.encoder_attn_layer_norm.weight decoder.layers.0.encoder_attn_layer_norm.bias   decoder.layers.0.fc1.weight decoder.layers.0.fc1.bias   decoder.layers.0.fc2.weight decoder.layers.0.fc2.bias   decoder.layers.0.final_layer_norm.weight    decoder.layers.0.final_layer_norm.bias  decoder.layers.1.self_attn.in_proj_weight   decoder.layers.1.self_attn.in_proj_bias decoder.layers.1.self_attn.out_proj.weight  decoder.layers.1.self_attn.out_proj.bias    decoder.layers.1.self_attn_layer_norm.weight    decoder.layers.1.self_attn_layer_norm.bias  decoder.layers.1.encoder_attn.in_proj_weight    decoder.layers.1.encoder_attn.in_proj_bias  decoder.layers.1.encoder_attn.out_proj.weight   decoder.layers.1.encoder_attn.out_proj.bias decoder.layers.1.encoder_attn_layer_norm.weight decoder.layers.1.encoder_attn_layer_norm.bias   decoder.layers.1.fc1.weight decoder.layers.1.fc1.bias   decoder.layers.1.fc2.weight decoder.layers.1.fc2.bias   decoder.layers.1.final_layer_norm.weight    decoder.layers.1.final_layer_norm.bias  decoder.layers.2.self_attn.in_proj_weight   decoder.layers.2.self_attn.in_proj_bias decoder.layers.2.self_attn.out_proj.weight  decoder.layers.2.self_attn.out_proj.bias    decoder.layers.2.self_attn_layer_norm.weight    decoder.layers.2.self_attn_layer_norm.bias  decoder.layers.2.encoder_attn.in_proj_weight    decoder.layers.2.encoder_attn.in_proj_bias  decoder.layers.2.encoder_attn.out_proj.weight   decoder.layers.2.encoder_attn.out_proj.bias decoder.layers.2.encoder_attn_layer_norm.weight decoder.layers.2.encoder_attn_layer_norm.bias   decoder.layers.2.fc1.weight decoder.layers.2.fc1.bias   decoder.layers.2.fc2.weight decoder.layers.2.fc2.bias   decoder.layers.2.final_layer_norm.weight    decoder.layers.2.final_layer_norm.bias  decoder.layers.3.self_attn.in_proj_weight   decoder.layers.3.self_attn.in_proj_bias decoder.layers.3.self_attn.out_proj.weight  decoder.layers.3.self_attn.out_proj.bias    decoder.layers.3.self_attn_layer_norm.weight    decoder.layers.3.self_attn_layer_norm.bias  decoder.layers.3.encoder_attn.in_proj_weight    decoder.layers.3.encoder_attn.in_proj_bias  decoder.layers.3.encoder_attn.out_proj.weight   decoder.layers.3.encoder_attn.out_proj.bias decoder.layers.3.encoder_attn_layer_norm.weight decoder.layers.3.encoder_attn_layer_norm.bias   decoder.layers.3.fc1.weight decoder.layers.3.fc1.bias   decoder.layers.3.fc2.weight decoder.layers.3.fc2.bias   decoder.layers.3.final_layer_norm.weight    decoder.layers.3.final_layer_norm.bias  decoder.layers.4.self_attn.in_proj_weight   decoder.layers.4.self_attn.in_proj_bias decoder.layers.4.self_attn.out_proj.weight  decoder.layers.4.self_attn.out_proj.bias    decoder.layers.4.self_attn_layer_norm.weight    decoder.layers.4.self_attn_layer_norm.bias  decoder.layers.4.encoder_attn.in_proj_weight    decoder.layers.4.encoder_attn.in_proj_bias  decoder.layers.4.encoder_attn.out_proj.weight   decoder.layers.4.encoder_attn.out_proj.bias decoder.layers.4.encoder_attn_layer_norm.weight decoder.layers.4.encoder_attn_layer_norm.bias   decoder.layers.4.fc1.weight decoder.layers.4.fc1.bias   decoder.layers.4.fc2.weight decoder.layers.4.fc2.bias   decoder.layers.4.final_layer_norm.weight    decoder.layers.4.final_layer_norm.bias  decoder.layers.5.self_attn.in_proj_weight   decoder.layers.5.self_attn.in_proj_bias decoder.layers.5.self_attn.out_proj.weight  decoder.layers.5.self_attn.out_proj.bias    decoder.layers.5.self_attn_layer_norm.weight    decoder.layers.5.self_attn_layer_norm.bias  decoder.layers.5.encoder_attn.in_proj_weight    decoder.layers.5.encoder_attn.in_proj_bias  decoder.layers.5.encoder_attn.out_proj.weight   decoder.layers.5.encoder_attn.out_proj.bias decoder.layers.5.encoder_attn_layer_norm.weight decoder.layers.5.encoder_attn_layer_norm.bias   decoder.layers.5.fc1.weight decoder.layers.5.fc1.bias   decoder.layers.5.fc2.weight decoder.layers.5.fc2.bias   decoder.layers.5.final_layer_norm.weight    decoder.layers.5.final_layer_norm.bias
| training on 8 GPUs
| max tokens per GPU = 1536 and max sentences per GPU = None
| WARNING: 20 samples have invalid sizes and will be skipped, max_positions=(512, 512), first few sample ids=[1464114, 1797546, 1513624, 1365715, 1841665, 1553144, 1795797, 1643331, 1583565, 1704988]
| distributed init (rank 7): tcp://localhost:17056
Namespace(adam_betas='(0.9, 0.98)', adam_eps=1e-08, adaptive_input=False, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, arch='big_xlm_translation', article_transformer_decoder_layer=False, attention_dropout=0.0, attn_fertility_method=None, bucket_cap_mb=25, clip_norm=0.0, continue_n_epoch=100000000, cpu=False, criterion='label_smoothed_cross_entropy', curriculum=0, data=['data-bin/xlm_pre_NN.ende'], ddp_backend='no_c10d', decoder_attention_heads=16, decoder_embed_dim=1024, decoder_embed_path=None, decoder_ffn_embed_dim=4096, decoder_input_dim=1024, decoder_layers=6, decoder_learned_pos=False, decoder_normalize_before=False, decoder_output_dim=1024, dependency_nn_layers=None, device_id=0, distributed_backend='nccl', distributed_init_method='tcp://localhost:17056', distributed_port=-1, distributed_rank=0, distributed_world_size=8, dropout=0.3, encoder_attention_heads=16, encoder_embed_dim=1024, encoder_embed_path=None, encoder_ffn_embed_dim=4096, encoder_layers=6, encoder_learned_pos=False, encoder_normalize_before=False, encoder_trainable='True', eval_gpu=99, extract_from_encoder_out_plus_word_embedding='False', extract_selfatt_layers=None, extracta_nodrop=False, extracta_path=None, fix_batches_to_gpus=False, fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, gold_data=None, gumbel_softmax_warm=None, keep_interval_updates=-1, keep_last_epochs=-1, label_smoothing=0.1, lambda_attention_weight_dependency=None, lambda_attn_entropy=None, lambda_attn_fertility_loss=None, lambda_enc_weight=None, lazy_load=False, left_pad_source='True', left_pad_target='False', log_format=None, log_interval=1000, lr=[0.001], lr_scheduler='inverse_sqrt', lr_shrink=0.1, max_a_method='', max_epoch=0, max_sentences=None, max_sentences_valid=None, max_source_positions=512, max_target_positions=512, max_tokens=1536, max_update=0, memory_efficient_fp16=False, min_loss_scale=0.0001, min_lr=1e-09, momentum=0.99, no_epoch_checkpoints=False, no_progress_bar=False, no_save=False, no_token_positional_embeddings=False, no_update_n_model=5, num_workers=0, optimizer='adam', optimizer_overrides='{}', output_extra_loss_ratio=False, raw_text=False, reload_xlm_ckpt='ckpts/mlm_ende_1024.pth', relu_dropout=0.0, required_batch_size_multiple=8, reset_lr_scheduler=False, reset_optimizer=False, restore_a_only_init=False, restore_NN_layer='', restore_file='checkpoint_last.pt', restore_max_a=False, restore_transformer='', save_dir='ckpts/ende_xlm_translation', save_interval=1, save_interval_updates=3000, seed=1, sentence_avg=False, share_all_embeddings=False, share_decoder_input_output_embed=False, skip_invalid_size_inputs_valid_test=False, sol3_lambda_dynamic_window_size=None, sol3_lambda_window_loss=None, sol3_window_size=None, sol4_dynamic2d=None, sol4_lambda_one2d=None, sol4_one2d=None, source_lang='en', src_act_path=None, store_attention_matrix='', target_lang='de', task='translation', tensorboard_logdir='', threshold_loss_scale=None, train_NN_layer=False, train_subset='train', update_freq=[28], upsample_primary=1, user_dir=None, valid_subset='valid', validate_interval=1, warmup_init_lr=1e-07, warmup_updates=4000, weight_decay=0.0)
| [en] dictionary: 64699 types
| [de] dictionary: 64699 types
| data-bin/xlm_pre_NN.ende train 1907028 examples
| data-bin/xlm_pre_NN.ende valid 508 examples
TransformerNNmentLayerModel(
  (encoder): XLM_Encoder(
    (model): TransformerModel(
      (position_embeddings): Embedding(512, 1024)
      (lang_embeddings): Embedding(2, 1024)
      (embeddings): Embedding(64699, 1024, padding_idx=2)
      (layer_norm_emb): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
      (attentions): ModuleList(
        (0): MultiHeadAttention(
          (q_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (k_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (v_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (out_lin): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (1): MultiHeadAttention(
          (q_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (k_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (v_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (out_lin): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (2): MultiHeadAttention(
          (q_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (k_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (v_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (out_lin): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (3): MultiHeadAttention(
          (q_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (k_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (v_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (out_lin): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (4): MultiHeadAttention(
          (q_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (k_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (v_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (out_lin): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (5): MultiHeadAttention(
          (q_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (k_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (v_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (out_lin): Linear(in_features=1024, out_features=1024, bias=True)
        )
      )
      (layer_norm1): ModuleList(
        (0): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (1): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (2): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (3): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (4): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (5): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
      )
      (ffns): ModuleList(
        (0): TransformerFFN(
          (lin1): Linear(in_features=1024, out_features=4096, bias=True)
          (lin2): Linear(in_features=4096, out_features=1024, bias=True)
        )
        (1): TransformerFFN(
          (lin1): Linear(in_features=1024, out_features=4096, bias=True)
          (lin2): Linear(in_features=4096, out_features=1024, bias=True)
        )
        (2): TransformerFFN(
          (lin1): Linear(in_features=1024, out_features=4096, bias=True)
          (lin2): Linear(in_features=4096, out_features=1024, bias=True)
        )
        (3): TransformerFFN(
          (lin1): Linear(in_features=1024, out_features=4096, bias=True)
          (lin2): Linear(in_features=4096, out_features=1024, bias=True)
        )
        (4): TransformerFFN(
          (lin1): Linear(in_features=1024, out_features=4096, bias=True)
          (lin2): Linear(in_features=4096, out_features=1024, bias=True)
        )
        (5): TransformerFFN(
          (lin1): Linear(in_features=1024, out_features=4096, bias=True)
          (lin2): Linear(in_features=4096, out_features=1024, bias=True)
        )
      )
      (layer_norm2): ModuleList(
        (0): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (1): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (2): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (3): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (4): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (5): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
      )
      (memories): ModuleDict()
      (pred_layer): PredLayer(
        (proj): Linear(in_features=1024, out_features=64699, bias=True)
      )
    )
  )
  (decoder): TransformerDecoder(
    (embed_tokens): Embedding(64699, 1024, padding_idx=2)
    (embed_positions): SinusoidalPositionalEmbedding()
    (layers): ModuleList(
      (0): TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
      )
      (1): TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
      )
      (2): TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
      )
      (3): TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
      )
      (4): TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
      )
      (5): TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
      )
    )
  )
)
| model big_xlm_translation, criterion LabelSmoothedCrossEntropyCriterion
| num. model params: 375705787 (num. trained: 375705787)
| Trainable params.encoder.model.position_embeddings.weight encoder.model.lang_embeddings.weight    encoder.model.embeddings.weight encoder.model.layer_norm_emb.weight encoder.model.layer_norm_emb.bias   encoder.model.attentions.0.q_lin.weight encoder.model.attentions.0.q_lin.bias   encoder.model.attentions.0.k_lin.weight encoder.model.attentions.0.k_lin.bias   encoder.model.attentions.0.v_lin.weight encoder.model.attentions.0.v_lin.bias   encoder.model.attentions.0.out_lin.weight   encoder.model.attentions.0.out_lin.bias encoder.model.attentions.1.q_lin.weight encoder.model.attentions.1.q_lin.bias   encoder.model.attentions.1.k_lin.weight encoder.model.attentions.1.k_lin.bias   encoder.model.attentions.1.v_lin.weight encoder.model.attentions.1.v_lin.bias   encoder.model.attentions.1.out_lin.weight   encoder.model.attentions.1.out_lin.bias encoder.model.attentions.2.q_lin.weight encoder.model.attentions.2.q_lin.bias   encoder.model.attentions.2.k_lin.weight encoder.model.attentions.2.k_lin.bias   encoder.model.attentions.2.v_lin.weight encoder.model.attentions.2.v_lin.bias   encoder.model.attentions.2.out_lin.weight   encoder.model.attentions.2.out_lin.bias encoder.model.attentions.3.q_lin.weight encoder.model.attentions.3.q_lin.bias   encoder.model.attentions.3.k_lin.weight encoder.model.attentions.3.k_lin.bias   encoder.model.attentions.3.v_lin.weight encoder.model.attentions.3.v_lin.bias   encoder.model.attentions.3.out_lin.weight   encoder.model.attentions.3.out_lin.bias encoder.model.attentions.4.q_lin.weight encoder.model.attentions.4.q_lin.bias   encoder.model.attentions.4.k_lin.weight encoder.model.attentions.4.k_lin.bias   encoder.model.attentions.4.v_lin.weight encoder.model.attentions.4.v_lin.bias   encoder.model.attentions.4.out_lin.weight   encoder.model.attentions.4.out_lin.bias encoder.model.attentions.5.q_lin.weight encoder.model.attentions.5.q_lin.bias   encoder.model.attentions.5.k_lin.weight encoder.model.attentions.5.k_lin.bias   encoder.model.attentions.5.v_lin.weight encoder.model.attentions.5.v_lin.bias   encoder.model.attentions.5.out_lin.weight   encoder.model.attentions.5.out_lin.bias encoder.model.layer_norm1.0.weight  encoder.model.layer_norm1.0.bias    encoder.model.layer_norm1.1.weight  encoder.model.layer_norm1.1.bias    encoder.model.layer_norm1.2.weight  encoder.model.layer_norm1.2.bias    encoder.model.layer_norm1.3.weight  encoder.model.layer_norm1.3.bias    encoder.model.layer_norm1.4.weight  encoder.model.layer_norm1.4.bias    encoder.model.layer_norm1.5.weight  encoder.model.layer_norm1.5.bias    encoder.model.ffns.0.lin1.weight    encoder.model.ffns.0.lin1.bias  encoder.model.ffns.0.lin2.weight    encoder.model.ffns.0.lin2.bias  encoder.model.ffns.1.lin1.weight    encoder.model.ffns.1.lin1.bias  encoder.model.ffns.1.lin2.weight    encoder.model.ffns.1.lin2.bias  encoder.model.ffns.2.lin1.weight    encoder.model.ffns.2.lin1.bias  encoder.model.ffns.2.lin2.weight    encoder.model.ffns.2.lin2.bias  encoder.model.ffns.3.lin1.weight    encoder.model.ffns.3.lin1.bias  encoder.model.ffns.3.lin2.weight    encoder.model.ffns.3.lin2.bias  encoder.model.ffns.4.lin1.weight    encoder.model.ffns.4.lin1.bias  encoder.model.ffns.4.lin2.weight    encoder.model.ffns.4.lin2.bias  encoder.model.ffns.5.lin1.weight    encoder.model.ffns.5.lin1.bias  encoder.model.ffns.5.lin2.weight    encoder.model.ffns.5.lin2.bias  encoder.model.layer_norm2.0.weight  encoder.model.layer_norm2.0.bias    encoder.model.layer_norm2.1.weight  encoder.model.layer_norm2.1.bias    encoder.model.layer_norm2.2.weight  encoder.model.layer_norm2.2.bias    encoder.model.layer_norm2.3.weight  encoder.model.layer_norm2.3.bias    encoder.model.layer_norm2.4.weight  encoder.model.layer_norm2.4.bias    encoder.model.layer_norm2.5.weight  encoder.model.layer_norm2.5.bias    encoder.model.pred_layer.proj.bias  decoder.embed_out   decoder.embed_tokens.weight decoder.layers.0.self_attn.in_proj_weight   decoder.layers.0.self_attn.in_proj_bias decoder.layers.0.self_attn.out_proj.weight  decoder.layers.0.self_attn.out_proj.bias    decoder.layers.0.self_attn_layer_norm.weight    decoder.layers.0.self_attn_layer_norm.bias  decoder.layers.0.encoder_attn.in_proj_weight    decoder.layers.0.encoder_attn.in_proj_bias  decoder.layers.0.encoder_attn.out_proj.weight   decoder.layers.0.encoder_attn.out_proj.bias decoder.layers.0.encoder_attn_layer_norm.weight decoder.layers.0.encoder_attn_layer_norm.bias   decoder.layers.0.fc1.weight decoder.layers.0.fc1.bias   decoder.layers.0.fc2.weight decoder.layers.0.fc2.bias   decoder.layers.0.final_layer_norm.weight    decoder.layers.0.final_layer_norm.bias  decoder.layers.1.self_attn.in_proj_weight   decoder.layers.1.self_attn.in_proj_bias decoder.layers.1.self_attn.out_proj.weight  decoder.layers.1.self_attn.out_proj.bias    decoder.layers.1.self_attn_layer_norm.weight    decoder.layers.1.self_attn_layer_norm.bias  decoder.layers.1.encoder_attn.in_proj_weight    decoder.layers.1.encoder_attn.in_proj_bias  decoder.layers.1.encoder_attn.out_proj.weight   decoder.layers.1.encoder_attn.out_proj.bias decoder.layers.1.encoder_attn_layer_norm.weight decoder.layers.1.encoder_attn_layer_norm.bias   decoder.layers.1.fc1.weight decoder.layers.1.fc1.bias   decoder.layers.1.fc2.weight decoder.layers.1.fc2.bias   decoder.layers.1.final_layer_norm.weight    decoder.layers.1.final_layer_norm.bias  decoder.layers.2.self_attn.in_proj_weight   decoder.layers.2.self_attn.in_proj_bias decoder.layers.2.self_attn.out_proj.weight  decoder.layers.2.self_attn.out_proj.bias    decoder.layers.2.self_attn_layer_norm.weight    decoder.layers.2.self_attn_layer_norm.bias  decoder.layers.2.encoder_attn.in_proj_weight    decoder.layers.2.encoder_attn.in_proj_bias  decoder.layers.2.encoder_attn.out_proj.weight   decoder.layers.2.encoder_attn.out_proj.bias decoder.layers.2.encoder_attn_layer_norm.weight decoder.layers.2.encoder_attn_layer_norm.bias   decoder.layers.2.fc1.weight decoder.layers.2.fc1.bias   decoder.layers.2.fc2.weight decoder.layers.2.fc2.bias   decoder.layers.2.final_layer_norm.weight    decoder.layers.2.final_layer_norm.bias  decoder.layers.3.self_attn.in_proj_weight   decoder.layers.3.self_attn.in_proj_bias decoder.layers.3.self_attn.out_proj.weight  decoder.layers.3.self_attn.out_proj.bias    decoder.layers.3.self_attn_layer_norm.weight    decoder.layers.3.self_attn_layer_norm.bias  decoder.layers.3.encoder_attn.in_proj_weight    decoder.layers.3.encoder_attn.in_proj_bias  decoder.layers.3.encoder_attn.out_proj.weight   decoder.layers.3.encoder_attn.out_proj.bias decoder.layers.3.encoder_attn_layer_norm.weight decoder.layers.3.encoder_attn_layer_norm.bias   decoder.layers.3.fc1.weight decoder.layers.3.fc1.bias   decoder.layers.3.fc2.weight decoder.layers.3.fc2.bias   decoder.layers.3.final_layer_norm.weight    decoder.layers.3.final_layer_norm.bias  decoder.layers.4.self_attn.in_proj_weight   decoder.layers.4.self_attn.in_proj_bias decoder.layers.4.self_attn.out_proj.weight  decoder.layers.4.self_attn.out_proj.bias    decoder.layers.4.self_attn_layer_norm.weight    decoder.layers.4.self_attn_layer_norm.bias  decoder.layers.4.encoder_attn.in_proj_weight    decoder.layers.4.encoder_attn.in_proj_bias  decoder.layers.4.encoder_attn.out_proj.weight   decoder.layers.4.encoder_attn.out_proj.bias decoder.layers.4.encoder_attn_layer_norm.weight decoder.layers.4.encoder_attn_layer_norm.bias   decoder.layers.4.fc1.weight decoder.layers.4.fc1.bias   decoder.layers.4.fc2.weight decoder.layers.4.fc2.bias   decoder.layers.4.final_layer_norm.weight    decoder.layers.4.final_layer_norm.bias  decoder.layers.5.self_attn.in_proj_weight   decoder.layers.5.self_attn.in_proj_bias decoder.layers.5.self_attn.out_proj.weight  decoder.layers.5.self_attn.out_proj.bias    decoder.layers.5.self_attn_layer_norm.weight    decoder.layers.5.self_attn_layer_norm.bias  decoder.layers.5.encoder_attn.in_proj_weight    decoder.layers.5.encoder_attn.in_proj_bias  decoder.layers.5.encoder_attn.out_proj.weight   decoder.layers.5.encoder_attn.out_proj.bias decoder.layers.5.encoder_attn_layer_norm.weight decoder.layers.5.encoder_attn_layer_norm.bias   decoder.layers.5.fc1.weight decoder.layers.5.fc1.bias   decoder.layers.5.fc2.weight decoder.layers.5.fc2.bias   decoder.layers.5.final_layer_norm.weight    decoder.layers.5.final_layer_norm.bias
| training on 8 GPUs
| max tokens per GPU = 1536 and max sentences per GPU = None
| WARNING: 20 samples have invalid sizes and will be skipped, max_positions=(512, 512), first few sample ids=[1464114, 1797546, 1513624, 1365715, 1841665, 1553144, 1795797, 1643331, 1583565, 1704988]
| distributed init (rank 0): tcp://localhost:17056
Namespace(adam_betas='(0.9, 0.98)', adam_eps=1e-08, adaptive_input=False, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, arch='big_xlm_translation', article_transformer_decoder_layer=False, attention_dropout=0.0, attn_fertility_method=None, bucket_cap_mb=25, clip_norm=0.0, continue_n_epoch=100000000, cpu=False, criterion='label_smoothed_cross_entropy', curriculum=0, data=['data-bin/xlm_pre_NN.ende'], ddp_backend='no_c10d', decoder_attention_heads=16, decoder_embed_dim=1024, decoder_embed_path=None, decoder_ffn_embed_dim=4096, decoder_input_dim=1024, decoder_layers=6, decoder_learned_pos=False, decoder_normalize_before=False, decoder_output_dim=1024, dependency_nn_layers=None, device_id=6, distributed_backend='nccl', distributed_init_method='tcp://localhost:17056', distributed_port=-1, distributed_rank=6, distributed_world_size=8, dropout=0.3, encoder_attention_heads=16, encoder_embed_dim=1024, encoder_embed_path=None, encoder_ffn_embed_dim=4096, encoder_layers=6, encoder_learned_pos=False, encoder_normalize_before=False, encoder_trainable='True', eval_gpu=99, extract_from_encoder_out_plus_word_embedding='False', extract_selfatt_layers=None, extracta_nodrop=False, extracta_path=None, fix_batches_to_gpus=False, fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, gold_data=None, gumbel_softmax_warm=None, keep_interval_updates=-1, keep_last_epochs=-1, label_smoothing=0.1, lambda_attention_weight_dependency=None, lambda_attn_entropy=None, lambda_attn_fertility_loss=None, lambda_enc_weight=None, lazy_load=False, left_pad_source='True', left_pad_target='False', log_format=None, log_interval=1000, lr=[0.001], lr_scheduler='inverse_sqrt', lr_shrink=0.1, max_a_method='', max_epoch=0, max_sentences=None, max_sentences_valid=None, max_source_positions=512, max_target_positions=512, max_tokens=1536, max_update=0, memory_efficient_fp16=False, min_loss_scale=0.0001, min_lr=1e-09, momentum=0.99, no_epoch_checkpoints=False, no_progress_bar=False, no_save=False, no_token_positional_embeddings=False, no_update_n_model=5, num_workers=0, optimizer='adam', optimizer_overrides='{}', output_extra_loss_ratio=False, raw_text=False, reload_xlm_ckpt='ckpts/mlm_ende_1024.pth', relu_dropout=0.0, required_batch_size_multiple=8, reset_lr_scheduler=False, reset_optimizer=False, restore_a_only_init=False, restore_NN_layer='', restore_file='checkpoint_last.pt', restore_max_a=False, restore_transformer='', save_dir='ckpts/ende_xlm_translation', save_interval=1, save_interval_updates=3000, seed=1, sentence_avg=False, share_all_embeddings=False, share_decoder_input_output_embed=False, skip_invalid_size_inputs_valid_test=False, sol3_lambda_dynamic_window_size=None, sol3_lambda_window_loss=None, sol3_window_size=None, sol4_dynamic2d=None, sol4_lambda_one2d=None, sol4_one2d=None, source_lang='en', src_act_path=None, store_attention_matrix='', target_lang='de', task='translation', tensorboard_logdir='', threshold_loss_scale=None, train_NN_layer=False, train_subset='train', update_freq=[28], upsample_primary=1, user_dir=None, valid_subset='valid', validate_interval=1, warmup_init_lr=1e-07, warmup_updates=4000, weight_decay=0.0)
| [en] dictionary: 64699 types
| [de] dictionary: 64699 types
| data-bin/xlm_pre_NN.ende train 1907028 examples
| data-bin/xlm_pre_NN.ende valid 508 examples
TransformerNNmentLayerModel(
  (encoder): XLM_Encoder(
    (model): TransformerModel(
      (position_embeddings): Embedding(512, 1024)
      (lang_embeddings): Embedding(2, 1024)
      (embeddings): Embedding(64699, 1024, padding_idx=2)
      (layer_norm_emb): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
      (attentions): ModuleList(
        (0): MultiHeadAttention(
          (q_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (k_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (v_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (out_lin): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (1): MultiHeadAttention(
          (q_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (k_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (v_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (out_lin): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (2): MultiHeadAttention(
          (q_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (k_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (v_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (out_lin): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (3): MultiHeadAttention(
          (q_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (k_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (v_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (out_lin): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (4): MultiHeadAttention(
          (q_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (k_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (v_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (out_lin): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (5): MultiHeadAttention(
          (q_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (k_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (v_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (out_lin): Linear(in_features=1024, out_features=1024, bias=True)
        )
      )
      (layer_norm1): ModuleList(
        (0): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (1): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (2): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (3): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (4): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (5): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
      )
      (ffns): ModuleList(
        (0): TransformerFFN(
          (lin1): Linear(in_features=1024, out_features=4096, bias=True)
          (lin2): Linear(in_features=4096, out_features=1024, bias=True)
        )
        (1): TransformerFFN(
          (lin1): Linear(in_features=1024, out_features=4096, bias=True)
          (lin2): Linear(in_features=4096, out_features=1024, bias=True)
        )
        (2): TransformerFFN(
          (lin1): Linear(in_features=1024, out_features=4096, bias=True)
          (lin2): Linear(in_features=4096, out_features=1024, bias=True)
        )
        (3): TransformerFFN(
          (lin1): Linear(in_features=1024, out_features=4096, bias=True)
          (lin2): Linear(in_features=4096, out_features=1024, bias=True)
        )
        (4): TransformerFFN(
          (lin1): Linear(in_features=1024, out_features=4096, bias=True)
          (lin2): Linear(in_features=4096, out_features=1024, bias=True)
        )
        (5): TransformerFFN(
          (lin1): Linear(in_features=1024, out_features=4096, bias=True)
          (lin2): Linear(in_features=4096, out_features=1024, bias=True)
        )
      )
      (layer_norm2): ModuleList(
        (0): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (1): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (2): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (3): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (4): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (5): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
      )
      (memories): ModuleDict()
      (pred_layer): PredLayer(
        (proj): Linear(in_features=1024, out_features=64699, bias=True)
      )
    )
  )
  (decoder): TransformerDecoder(
    (embed_tokens): Embedding(64699, 1024, padding_idx=2)
    (embed_positions): SinusoidalPositionalEmbedding()
    (layers): ModuleList(
      (0): TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
      )
      (1): TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
      )
      (2): TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
      )
      (3): TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
      )
      (4): TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
      )
      (5): TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
      )
    )
  )
)
| model big_xlm_translation, criterion LabelSmoothedCrossEntropyCriterion
| num. model params: 375705787 (num. trained: 375705787)
| Trainable params.encoder.model.position_embeddings.weight encoder.model.lang_embeddings.weight    encoder.model.embeddings.weight encoder.model.layer_norm_emb.weight encoder.model.layer_norm_emb.bias   encoder.model.attentions.0.q_lin.weight encoder.model.attentions.0.q_lin.bias   encoder.model.attentions.0.k_lin.weight encoder.model.attentions.0.k_lin.bias   encoder.model.attentions.0.v_lin.weight encoder.model.attentions.0.v_lin.bias   encoder.model.attentions.0.out_lin.weight   encoder.model.attentions.0.out_lin.bias encoder.model.attentions.1.q_lin.weight encoder.model.attentions.1.q_lin.bias   encoder.model.attentions.1.k_lin.weight encoder.model.attentions.1.k_lin.bias   encoder.model.attentions.1.v_lin.weight encoder.model.attentions.1.v_lin.bias   encoder.model.attentions.1.out_lin.weight   encoder.model.attentions.1.out_lin.bias encoder.model.attentions.2.q_lin.weight encoder.model.attentions.2.q_lin.bias   encoder.model.attentions.2.k_lin.weight encoder.model.attentions.2.k_lin.bias   encoder.model.attentions.2.v_lin.weight encoder.model.attentions.2.v_lin.bias   encoder.model.attentions.2.out_lin.weight   encoder.model.attentions.2.out_lin.bias encoder.model.attentions.3.q_lin.weight encoder.model.attentions.3.q_lin.bias   encoder.model.attentions.3.k_lin.weight encoder.model.attentions.3.k_lin.bias   encoder.model.attentions.3.v_lin.weight encoder.model.attentions.3.v_lin.bias   encoder.model.attentions.3.out_lin.weight   encoder.model.attentions.3.out_lin.bias encoder.model.attentions.4.q_lin.weight encoder.model.attentions.4.q_lin.bias   encoder.model.attentions.4.k_lin.weight encoder.model.attentions.4.k_lin.bias   encoder.model.attentions.4.v_lin.weight encoder.model.attentions.4.v_lin.bias   encoder.model.attentions.4.out_lin.weight   encoder.model.attentions.4.out_lin.bias encoder.model.attentions.5.q_lin.weight encoder.model.attentions.5.q_lin.bias   encoder.model.attentions.5.k_lin.weight encoder.model.attentions.5.k_lin.bias   encoder.model.attentions.5.v_lin.weight encoder.model.attentions.5.v_lin.bias   encoder.model.attentions.5.out_lin.weight   encoder.model.attentions.5.out_lin.bias encoder.model.layer_norm1.0.weight  encoder.model.layer_norm1.0.bias    encoder.model.layer_norm1.1.weight  encoder.model.layer_norm1.1.bias    encoder.model.layer_norm1.2.weight  encoder.model.layer_norm1.2.bias    encoder.model.layer_norm1.3.weight  encoder.model.layer_norm1.3.bias    encoder.model.layer_norm1.4.weight  encoder.model.layer_norm1.4.bias    encoder.model.layer_norm1.5.weight  encoder.model.layer_norm1.5.bias    encoder.model.ffns.0.lin1.weight    encoder.model.ffns.0.lin1.bias  encoder.model.ffns.0.lin2.weight    encoder.model.ffns.0.lin2.bias  encoder.model.ffns.1.lin1.weight    encoder.model.ffns.1.lin1.bias  encoder.model.ffns.1.lin2.weight    encoder.model.ffns.1.lin2.bias  encoder.model.ffns.2.lin1.weight    encoder.model.ffns.2.lin1.bias  encoder.model.ffns.2.lin2.weight    encoder.model.ffns.2.lin2.bias  encoder.model.ffns.3.lin1.weight    encoder.model.ffns.3.lin1.bias  encoder.model.ffns.3.lin2.weight    encoder.model.ffns.3.lin2.bias  encoder.model.ffns.4.lin1.weight    encoder.model.ffns.4.lin1.bias  encoder.model.ffns.4.lin2.weight    encoder.model.ffns.4.lin2.bias  encoder.model.ffns.5.lin1.weight    encoder.model.ffns.5.lin1.bias  encoder.model.ffns.5.lin2.weight    encoder.model.ffns.5.lin2.bias  encoder.model.layer_norm2.0.weight  encoder.model.layer_norm2.0.bias    encoder.model.layer_norm2.1.weight  encoder.model.layer_norm2.1.bias    encoder.model.layer_norm2.2.weight  encoder.model.layer_norm2.2.bias    encoder.model.layer_norm2.3.weight  encoder.model.layer_norm2.3.bias    encoder.model.layer_norm2.4.weight  encoder.model.layer_norm2.4.bias    encoder.model.layer_norm2.5.weight  encoder.model.layer_norm2.5.bias    encoder.model.pred_layer.proj.bias  decoder.embed_out   decoder.embed_tokens.weight decoder.layers.0.self_attn.in_proj_weight   decoder.layers.0.self_attn.in_proj_bias decoder.layers.0.self_attn.out_proj.weight  decoder.layers.0.self_attn.out_proj.bias    decoder.layers.0.self_attn_layer_norm.weight    decoder.layers.0.self_attn_layer_norm.bias  decoder.layers.0.encoder_attn.in_proj_weight    decoder.layers.0.encoder_attn.in_proj_bias  decoder.layers.0.encoder_attn.out_proj.weight   decoder.layers.0.encoder_attn.out_proj.bias decoder.layers.0.encoder_attn_layer_norm.weight decoder.layers.0.encoder_attn_layer_norm.bias   decoder.layers.0.fc1.weight decoder.layers.0.fc1.bias   decoder.layers.0.fc2.weight decoder.layers.0.fc2.bias   decoder.layers.0.final_layer_norm.weight    decoder.layers.0.final_layer_norm.bias  decoder.layers.1.self_attn.in_proj_weight   decoder.layers.1.self_attn.in_proj_bias decoder.layers.1.self_attn.out_proj.weight  decoder.layers.1.self_attn.out_proj.bias    decoder.layers.1.self_attn_layer_norm.weight    decoder.layers.1.self_attn_layer_norm.bias  decoder.layers.1.encoder_attn.in_proj_weight    decoder.layers.1.encoder_attn.in_proj_bias  decoder.layers.1.encoder_attn.out_proj.weight   decoder.layers.1.encoder_attn.out_proj.bias decoder.layers.1.encoder_attn_layer_norm.weight decoder.layers.1.encoder_attn_layer_norm.bias   decoder.layers.1.fc1.weight decoder.layers.1.fc1.bias   decoder.layers.1.fc2.weight decoder.layers.1.fc2.bias   decoder.layers.1.final_layer_norm.weight    decoder.layers.1.final_layer_norm.bias  decoder.layers.2.self_attn.in_proj_weight   decoder.layers.2.self_attn.in_proj_bias decoder.layers.2.self_attn.out_proj.weight  decoder.layers.2.self_attn.out_proj.bias    decoder.layers.2.self_attn_layer_norm.weight    decoder.layers.2.self_attn_layer_norm.bias  decoder.layers.2.encoder_attn.in_proj_weight    decoder.layers.2.encoder_attn.in_proj_bias  decoder.layers.2.encoder_attn.out_proj.weight   decoder.layers.2.encoder_attn.out_proj.bias decoder.layers.2.encoder_attn_layer_norm.weight decoder.layers.2.encoder_attn_layer_norm.bias   decoder.layers.2.fc1.weight decoder.layers.2.fc1.bias   decoder.layers.2.fc2.weight decoder.layers.2.fc2.bias   decoder.layers.2.final_layer_norm.weight    decoder.layers.2.final_layer_norm.bias  decoder.layers.3.self_attn.in_proj_weight   decoder.layers.3.self_attn.in_proj_bias decoder.layers.3.self_attn.out_proj.weight  decoder.layers.3.self_attn.out_proj.bias    decoder.layers.3.self_attn_layer_norm.weight    decoder.layers.3.self_attn_layer_norm.bias  decoder.layers.3.encoder_attn.in_proj_weight    decoder.layers.3.encoder_attn.in_proj_bias  decoder.layers.3.encoder_attn.out_proj.weight   decoder.layers.3.encoder_attn.out_proj.bias decoder.layers.3.encoder_attn_layer_norm.weight decoder.layers.3.encoder_attn_layer_norm.bias   decoder.layers.3.fc1.weight decoder.layers.3.fc1.bias   decoder.layers.3.fc2.weight decoder.layers.3.fc2.bias   decoder.layers.3.final_layer_norm.weight    decoder.layers.3.final_layer_norm.bias  decoder.layers.4.self_attn.in_proj_weight   decoder.layers.4.self_attn.in_proj_bias decoder.layers.4.self_attn.out_proj.weight  decoder.layers.4.self_attn.out_proj.bias    decoder.layers.4.self_attn_layer_norm.weight    decoder.layers.4.self_attn_layer_norm.bias  decoder.layers.4.encoder_attn.in_proj_weight    decoder.layers.4.encoder_attn.in_proj_bias  decoder.layers.4.encoder_attn.out_proj.weight   decoder.layers.4.encoder_attn.out_proj.bias decoder.layers.4.encoder_attn_layer_norm.weight decoder.layers.4.encoder_attn_layer_norm.bias   decoder.layers.4.fc1.weight decoder.layers.4.fc1.bias   decoder.layers.4.fc2.weight decoder.layers.4.fc2.bias   decoder.layers.4.final_layer_norm.weight    decoder.layers.4.final_layer_norm.bias  decoder.layers.5.self_attn.in_proj_weight   decoder.layers.5.self_attn.in_proj_bias decoder.layers.5.self_attn.out_proj.weight  decoder.layers.5.self_attn.out_proj.bias    decoder.layers.5.self_attn_layer_norm.weight    decoder.layers.5.self_attn_layer_norm.bias  decoder.layers.5.encoder_attn.in_proj_weight    decoder.layers.5.encoder_attn.in_proj_bias  decoder.layers.5.encoder_attn.out_proj.weight   decoder.layers.5.encoder_attn.out_proj.bias decoder.layers.5.encoder_attn_layer_norm.weight decoder.layers.5.encoder_attn_layer_norm.bias   decoder.layers.5.fc1.weight decoder.layers.5.fc1.bias   decoder.layers.5.fc2.weight decoder.layers.5.fc2.bias   decoder.layers.5.final_layer_norm.weight    decoder.layers.5.final_layer_norm.bias
| training on 8 GPUs
| max tokens per GPU = 1536 and max sentences per GPU = None
| WARNING: 20 samples have invalid sizes and will be skipped, max_positions=(512, 512), first few sample ids=[1464114, 1797546, 1513624, 1365715, 1841665, 1553144, 1795797, 1643331, 1583565, 1704988]
| distributed init (rank 6): tcp://localhost:17056
Namespace(adam_betas='(0.9, 0.98)', adam_eps=1e-08, adaptive_input=False, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, arch='big_xlm_translation', article_transformer_decoder_layer=False, attention_dropout=0.0, attn_fertility_method=None, bucket_cap_mb=25, clip_norm=0.0, continue_n_epoch=100000000, cpu=False, criterion='label_smoothed_cross_entropy', curriculum=0, data=['data-bin/xlm_pre_NN.ende'], ddp_backend='no_c10d', decoder_attention_heads=16, decoder_embed_dim=1024, decoder_embed_path=None, decoder_ffn_embed_dim=4096, decoder_input_dim=1024, decoder_layers=6, decoder_learned_pos=False, decoder_normalize_before=False, decoder_output_dim=1024, dependency_nn_layers=None, device_id=2, distributed_backend='nccl', distributed_init_method='tcp://localhost:17056', distributed_port=-1, distributed_rank=2, distributed_world_size=8, dropout=0.3, encoder_attention_heads=16, encoder_embed_dim=1024, encoder_embed_path=None, encoder_ffn_embed_dim=4096, encoder_layers=6, encoder_learned_pos=False, encoder_normalize_before=False, encoder_trainable='True', eval_gpu=99, extract_from_encoder_out_plus_word_embedding='False', extract_selfatt_layers=None, extracta_nodrop=False, extracta_path=None, fix_batches_to_gpus=False, fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, gold_data=None, gumbel_softmax_warm=None, keep_interval_updates=-1, keep_last_epochs=-1, label_smoothing=0.1, lambda_attention_weight_dependency=None, lambda_attn_entropy=None, lambda_attn_fertility_loss=None, lambda_enc_weight=None, lazy_load=False, left_pad_source='True', left_pad_target='False', log_format=None, log_interval=1000, lr=[0.001], lr_scheduler='inverse_sqrt', lr_shrink=0.1, max_a_method='', max_epoch=0, max_sentences=None, max_sentences_valid=None, max_source_positions=512, max_target_positions=512, max_tokens=1536, max_update=0, memory_efficient_fp16=False, min_loss_scale=0.0001, min_lr=1e-09, momentum=0.99, no_epoch_checkpoints=False, no_progress_bar=False, no_save=False, no_token_positional_embeddings=False, no_update_n_model=5, num_workers=0, optimizer='adam', optimizer_overrides='{}', output_extra_loss_ratio=False, raw_text=False, reload_xlm_ckpt='ckpts/mlm_ende_1024.pth', relu_dropout=0.0, required_batch_size_multiple=8, reset_lr_scheduler=False, reset_optimizer=False, restore_a_only_init=False, restore_NN_layer='', restore_file='checkpoint_last.pt', restore_max_a=False, restore_transformer='', save_dir='ckpts/ende_xlm_translation', save_interval=1, save_interval_updates=3000, seed=1, sentence_avg=False, share_all_embeddings=False, share_decoder_input_output_embed=False, skip_invalid_size_inputs_valid_test=False, sol3_lambda_dynamic_window_size=None, sol3_lambda_window_loss=None, sol3_window_size=None, sol4_dynamic2d=None, sol4_lambda_one2d=None, sol4_one2d=None, source_lang='en', src_act_path=None, store_attention_matrix='', target_lang='de', task='translation', tensorboard_logdir='', threshold_loss_scale=None, train_NN_layer=False, train_subset='train', update_freq=[28], upsample_primary=1, user_dir=None, valid_subset='valid', validate_interval=1, warmup_init_lr=1e-07, warmup_updates=4000, weight_decay=0.0)
| [en] dictionary: 64699 types
| [de] dictionary: 64699 types
| data-bin/xlm_pre_NN.ende train 1907028 examples
| data-bin/xlm_pre_NN.ende valid 508 examples
TransformerNNmentLayerModel(
  (encoder): XLM_Encoder(
    (model): TransformerModel(
      (position_embeddings): Embedding(512, 1024)
      (lang_embeddings): Embedding(2, 1024)
      (embeddings): Embedding(64699, 1024, padding_idx=2)
      (layer_norm_emb): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
      (attentions): ModuleList(
        (0): MultiHeadAttention(
          (q_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (k_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (v_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (out_lin): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (1): MultiHeadAttention(
          (q_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (k_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (v_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (out_lin): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (2): MultiHeadAttention(
          (q_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (k_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (v_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (out_lin): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (3): MultiHeadAttention(
          (q_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (k_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (v_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (out_lin): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (4): MultiHeadAttention(
          (q_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (k_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (v_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (out_lin): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (5): MultiHeadAttention(
          (q_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (k_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (v_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (out_lin): Linear(in_features=1024, out_features=1024, bias=True)
        )
      )
      (layer_norm1): ModuleList(
        (0): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (1): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (2): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (3): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (4): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (5): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
      )
      (ffns): ModuleList(
        (0): TransformerFFN(
          (lin1): Linear(in_features=1024, out_features=4096, bias=True)
          (lin2): Linear(in_features=4096, out_features=1024, bias=True)
        )
        (1): TransformerFFN(
          (lin1): Linear(in_features=1024, out_features=4096, bias=True)
          (lin2): Linear(in_features=4096, out_features=1024, bias=True)
        )
        (2): TransformerFFN(
          (lin1): Linear(in_features=1024, out_features=4096, bias=True)
          (lin2): Linear(in_features=4096, out_features=1024, bias=True)
        )
        (3): TransformerFFN(
          (lin1): Linear(in_features=1024, out_features=4096, bias=True)
          (lin2): Linear(in_features=4096, out_features=1024, bias=True)
        )
        (4): TransformerFFN(
          (lin1): Linear(in_features=1024, out_features=4096, bias=True)
          (lin2): Linear(in_features=4096, out_features=1024, bias=True)
        )
        (5): TransformerFFN(
          (lin1): Linear(in_features=1024, out_features=4096, bias=True)
          (lin2): Linear(in_features=4096, out_features=1024, bias=True)
        )
      )
      (layer_norm2): ModuleList(
        (0): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (1): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (2): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (3): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (4): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (5): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
      )
      (memories): ModuleDict()
      (pred_layer): PredLayer(
        (proj): Linear(in_features=1024, out_features=64699, bias=True)
      )
    )
  )
  (decoder): TransformerDecoder(
    (embed_tokens): Embedding(64699, 1024, padding_idx=2)
    (embed_positions): SinusoidalPositionalEmbedding()
    (layers): ModuleList(
      (0): TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
      )
      (1): TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
      )
      (2): TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
      )
      (3): TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
      )
      (4): TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
      )
      (5): TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
      )
    )
  )
)
| model big_xlm_translation, criterion LabelSmoothedCrossEntropyCriterion
| num. model params: 375705787 (num. trained: 375705787)
| Trainable params.encoder.model.position_embeddings.weight encoder.model.lang_embeddings.weight    encoder.model.embeddings.weight encoder.model.layer_norm_emb.weight encoder.model.layer_norm_emb.bias   encoder.model.attentions.0.q_lin.weight encoder.model.attentions.0.q_lin.bias   encoder.model.attentions.0.k_lin.weight encoder.model.attentions.0.k_lin.bias   encoder.model.attentions.0.v_lin.weight encoder.model.attentions.0.v_lin.bias   encoder.model.attentions.0.out_lin.weight   encoder.model.attentions.0.out_lin.bias encoder.model.attentions.1.q_lin.weight encoder.model.attentions.1.q_lin.bias   encoder.model.attentions.1.k_lin.weight encoder.model.attentions.1.k_lin.bias   encoder.model.attentions.1.v_lin.weight encoder.model.attentions.1.v_lin.bias   encoder.model.attentions.1.out_lin.weight   encoder.model.attentions.1.out_lin.bias encoder.model.attentions.2.q_lin.weight encoder.model.attentions.2.q_lin.bias   encoder.model.attentions.2.k_lin.weight encoder.model.attentions.2.k_lin.bias   encoder.model.attentions.2.v_lin.weight encoder.model.attentions.2.v_lin.bias   encoder.model.attentions.2.out_lin.weight   encoder.model.attentions.2.out_lin.bias encoder.model.attentions.3.q_lin.weight encoder.model.attentions.3.q_lin.bias   encoder.model.attentions.3.k_lin.weight encoder.model.attentions.3.k_lin.bias   encoder.model.attentions.3.v_lin.weight encoder.model.attentions.3.v_lin.bias   encoder.model.attentions.3.out_lin.weight   encoder.model.attentions.3.out_lin.bias encoder.model.attentions.4.q_lin.weight encoder.model.attentions.4.q_lin.bias   encoder.model.attentions.4.k_lin.weight encoder.model.attentions.4.k_lin.bias   encoder.model.attentions.4.v_lin.weight encoder.model.attentions.4.v_lin.bias   encoder.model.attentions.4.out_lin.weight   encoder.model.attentions.4.out_lin.bias encoder.model.attentions.5.q_lin.weight encoder.model.attentions.5.q_lin.bias   encoder.model.attentions.5.k_lin.weight encoder.model.attentions.5.k_lin.bias   encoder.model.attentions.5.v_lin.weight encoder.model.attentions.5.v_lin.bias   encoder.model.attentions.5.out_lin.weight   encoder.model.attentions.5.out_lin.bias encoder.model.layer_norm1.0.weight  encoder.model.layer_norm1.0.bias    encoder.model.layer_norm1.1.weight  encoder.model.layer_norm1.1.bias    encoder.model.layer_norm1.2.weight  encoder.model.layer_norm1.2.bias    encoder.model.layer_norm1.3.weight  encoder.model.layer_norm1.3.bias    encoder.model.layer_norm1.4.weight  encoder.model.layer_norm1.4.bias    encoder.model.layer_norm1.5.weight  encoder.model.layer_norm1.5.bias    encoder.model.ffns.0.lin1.weight    encoder.model.ffns.0.lin1.bias  encoder.model.ffns.0.lin2.weight    encoder.model.ffns.0.lin2.bias  encoder.model.ffns.1.lin1.weight    encoder.model.ffns.1.lin1.bias  encoder.model.ffns.1.lin2.weight    encoder.model.ffns.1.lin2.bias  encoder.model.ffns.2.lin1.weight    encoder.model.ffns.2.lin1.bias  encoder.model.ffns.2.lin2.weight    encoder.model.ffns.2.lin2.bias  encoder.model.ffns.3.lin1.weight    encoder.model.ffns.3.lin1.bias  encoder.model.ffns.3.lin2.weight    encoder.model.ffns.3.lin2.bias  encoder.model.ffns.4.lin1.weight    encoder.model.ffns.4.lin1.bias  encoder.model.ffns.4.lin2.weight    encoder.model.ffns.4.lin2.bias  encoder.model.ffns.5.lin1.weight    encoder.model.ffns.5.lin1.bias  encoder.model.ffns.5.lin2.weight    encoder.model.ffns.5.lin2.bias  encoder.model.layer_norm2.0.weight  encoder.model.layer_norm2.0.bias    encoder.model.layer_norm2.1.weight  encoder.model.layer_norm2.1.bias    encoder.model.layer_norm2.2.weight  encoder.model.layer_norm2.2.bias    encoder.model.layer_norm2.3.weight  encoder.model.layer_norm2.3.bias    encoder.model.layer_norm2.4.weight  encoder.model.layer_norm2.4.bias    encoder.model.layer_norm2.5.weight  encoder.model.layer_norm2.5.bias    encoder.model.pred_layer.proj.bias  decoder.embed_out   decoder.embed_tokens.weight decoder.layers.0.self_attn.in_proj_weight   decoder.layers.0.self_attn.in_proj_bias decoder.layers.0.self_attn.out_proj.weight  decoder.layers.0.self_attn.out_proj.bias    decoder.layers.0.self_attn_layer_norm.weight    decoder.layers.0.self_attn_layer_norm.bias  decoder.layers.0.encoder_attn.in_proj_weight    decoder.layers.0.encoder_attn.in_proj_bias  decoder.layers.0.encoder_attn.out_proj.weight   decoder.layers.0.encoder_attn.out_proj.bias decoder.layers.0.encoder_attn_layer_norm.weight decoder.layers.0.encoder_attn_layer_norm.bias   decoder.layers.0.fc1.weight decoder.layers.0.fc1.bias   decoder.layers.0.fc2.weight decoder.layers.0.fc2.bias   decoder.layers.0.final_layer_norm.weight    decoder.layers.0.final_layer_norm.bias  decoder.layers.1.self_attn.in_proj_weight   decoder.layers.1.self_attn.in_proj_bias decoder.layers.1.self_attn.out_proj.weight  decoder.layers.1.self_attn.out_proj.bias    decoder.layers.1.self_attn_layer_norm.weight    decoder.layers.1.self_attn_layer_norm.bias  decoder.layers.1.encoder_attn.in_proj_weight    decoder.layers.1.encoder_attn.in_proj_bias  decoder.layers.1.encoder_attn.out_proj.weight   decoder.layers.1.encoder_attn.out_proj.bias decoder.layers.1.encoder_attn_layer_norm.weight decoder.layers.1.encoder_attn_layer_norm.bias   decoder.layers.1.fc1.weight decoder.layers.1.fc1.bias   decoder.layers.1.fc2.weight decoder.layers.1.fc2.bias   decoder.layers.1.final_layer_norm.weight    decoder.layers.1.final_layer_norm.bias  decoder.layers.2.self_attn.in_proj_weight   decoder.layers.2.self_attn.in_proj_bias decoder.layers.2.self_attn.out_proj.weight  decoder.layers.2.self_attn.out_proj.bias    decoder.layers.2.self_attn_layer_norm.weight    decoder.layers.2.self_attn_layer_norm.bias  decoder.layers.2.encoder_attn.in_proj_weight    decoder.layers.2.encoder_attn.in_proj_bias  decoder.layers.2.encoder_attn.out_proj.weight   decoder.layers.2.encoder_attn.out_proj.bias decoder.layers.2.encoder_attn_layer_norm.weight decoder.layers.2.encoder_attn_layer_norm.bias   decoder.layers.2.fc1.weight decoder.layers.2.fc1.bias   decoder.layers.2.fc2.weight decoder.layers.2.fc2.bias   decoder.layers.2.final_layer_norm.weight    decoder.layers.2.final_layer_norm.bias  decoder.layers.3.self_attn.in_proj_weight   decoder.layers.3.self_attn.in_proj_bias decoder.layers.3.self_attn.out_proj.weight  decoder.layers.3.self_attn.out_proj.bias    decoder.layers.3.self_attn_layer_norm.weight    decoder.layers.3.self_attn_layer_norm.bias  decoder.layers.3.encoder_attn.in_proj_weight    decoder.layers.3.encoder_attn.in_proj_bias  decoder.layers.3.encoder_attn.out_proj.weight   decoder.layers.3.encoder_attn.out_proj.bias decoder.layers.3.encoder_attn_layer_norm.weight decoder.layers.3.encoder_attn_layer_norm.bias   decoder.layers.3.fc1.weight decoder.layers.3.fc1.bias   decoder.layers.3.fc2.weight decoder.layers.3.fc2.bias   decoder.layers.3.final_layer_norm.weight    decoder.layers.3.final_layer_norm.bias  decoder.layers.4.self_attn.in_proj_weight   decoder.layers.4.self_attn.in_proj_bias decoder.layers.4.self_attn.out_proj.weight  decoder.layers.4.self_attn.out_proj.bias    decoder.layers.4.self_attn_layer_norm.weight    decoder.layers.4.self_attn_layer_norm.bias  decoder.layers.4.encoder_attn.in_proj_weight    decoder.layers.4.encoder_attn.in_proj_bias  decoder.layers.4.encoder_attn.out_proj.weight   decoder.layers.4.encoder_attn.out_proj.bias decoder.layers.4.encoder_attn_layer_norm.weight decoder.layers.4.encoder_attn_layer_norm.bias   decoder.layers.4.fc1.weight decoder.layers.4.fc1.bias   decoder.layers.4.fc2.weight decoder.layers.4.fc2.bias   decoder.layers.4.final_layer_norm.weight    decoder.layers.4.final_layer_norm.bias  decoder.layers.5.self_attn.in_proj_weight   decoder.layers.5.self_attn.in_proj_bias decoder.layers.5.self_attn.out_proj.weight  decoder.layers.5.self_attn.out_proj.bias    decoder.layers.5.self_attn_layer_norm.weight    decoder.layers.5.self_attn_layer_norm.bias  decoder.layers.5.encoder_attn.in_proj_weight    decoder.layers.5.encoder_attn.in_proj_bias  decoder.layers.5.encoder_attn.out_proj.weight   decoder.layers.5.encoder_attn.out_proj.bias decoder.layers.5.encoder_attn_layer_norm.weight decoder.layers.5.encoder_attn_layer_norm.bias   decoder.layers.5.fc1.weight decoder.layers.5.fc1.bias   decoder.layers.5.fc2.weight decoder.layers.5.fc2.bias   decoder.layers.5.final_layer_norm.weight    decoder.layers.5.final_layer_norm.bias
| training on 8 GPUs
| max tokens per GPU = 1536 and max sentences per GPU = None
| WARNING: 20 samples have invalid sizes and will be skipped, max_positions=(512, 512), first few sample ids=[1464114, 1797546, 1513624, 1365715, 1841665, 1553144, 1795797, 1643331, 1583565, 1704988]
| distributed init (rank 2): tcp://localhost:17056
Namespace(adam_betas='(0.9, 0.98)', adam_eps=1e-08, adaptive_input=False, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, arch='big_xlm_translation', article_transformer_decoder_layer=False, attention_dropout=0.0, attn_fertility_method=None, bucket_cap_mb=25, clip_norm=0.0, continue_n_epoch=100000000, cpu=False, criterion='label_smoothed_cross_entropy', curriculum=0, data=['data-bin/xlm_pre_NN.ende'], ddp_backend='no_c10d', decoder_attention_heads=16, decoder_embed_dim=1024, decoder_embed_path=None, decoder_ffn_embed_dim=4096, decoder_input_dim=1024, decoder_layers=6, decoder_learned_pos=False, decoder_normalize_before=False, decoder_output_dim=1024, dependency_nn_layers=None, device_id=4, distributed_backend='nccl', distributed_init_method='tcp://localhost:17056', distributed_port=-1, distributed_rank=4, distributed_world_size=8, dropout=0.3, encoder_attention_heads=16, encoder_embed_dim=1024, encoder_embed_path=None, encoder_ffn_embed_dim=4096, encoder_layers=6, encoder_learned_pos=False, encoder_normalize_before=False, encoder_trainable='True', eval_gpu=99, extract_from_encoder_out_plus_word_embedding='False', extract_selfatt_layers=None, extracta_nodrop=False, extracta_path=None, fix_batches_to_gpus=False, fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, gold_data=None, gumbel_softmax_warm=None, keep_interval_updates=-1, keep_last_epochs=-1, label_smoothing=0.1, lambda_attention_weight_dependency=None, lambda_attn_entropy=None, lambda_attn_fertility_loss=None, lambda_enc_weight=None, lazy_load=False, left_pad_source='True', left_pad_target='False', log_format=None, log_interval=1000, lr=[0.001], lr_scheduler='inverse_sqrt', lr_shrink=0.1, max_a_method='', max_epoch=0, max_sentences=None, max_sentences_valid=None, max_source_positions=512, max_target_positions=512, max_tokens=1536, max_update=0, memory_efficient_fp16=False, min_loss_scale=0.0001, min_lr=1e-09, momentum=0.99, no_epoch_checkpoints=False, no_progress_bar=False, no_save=False, no_token_positional_embeddings=False, no_update_n_model=5, num_workers=0, optimizer='adam', optimizer_overrides='{}', output_extra_loss_ratio=False, raw_text=False, reload_xlm_ckpt='ckpts/mlm_ende_1024.pth', relu_dropout=0.0, required_batch_size_multiple=8, reset_lr_scheduler=False, reset_optimizer=False, restore_a_only_init=False, restore_NN_layer='', restore_file='checkpoint_last.pt', restore_max_a=False, restore_transformer='', save_dir='ckpts/ende_xlm_translation', save_interval=1, save_interval_updates=3000, seed=1, sentence_avg=False, share_all_embeddings=False, share_decoder_input_output_embed=False, skip_invalid_size_inputs_valid_test=False, sol3_lambda_dynamic_window_size=None, sol3_lambda_window_loss=None, sol3_window_size=None, sol4_dynamic2d=None, sol4_lambda_one2d=None, sol4_one2d=None, source_lang='en', src_act_path=None, store_attention_matrix='', target_lang='de', task='translation', tensorboard_logdir='', threshold_loss_scale=None, train_NN_layer=False, train_subset='train', update_freq=[28], upsample_primary=1, user_dir=None, valid_subset='valid', validate_interval=1, warmup_init_lr=1e-07, warmup_updates=4000, weight_decay=0.0)
| [en] dictionary: 64699 types
| [de] dictionary: 64699 types
| data-bin/xlm_pre_NN.ende train 1907028 examples
| data-bin/xlm_pre_NN.ende valid 508 examples
TransformerNNmentLayerModel(
  (encoder): XLM_Encoder(
    (model): TransformerModel(
      (position_embeddings): Embedding(512, 1024)
      (lang_embeddings): Embedding(2, 1024)
      (embeddings): Embedding(64699, 1024, padding_idx=2)
      (layer_norm_emb): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
      (attentions): ModuleList(
        (0): MultiHeadAttention(
          (q_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (k_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (v_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (out_lin): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (1): MultiHeadAttention(
          (q_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (k_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (v_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (out_lin): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (2): MultiHeadAttention(
          (q_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (k_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (v_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (out_lin): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (3): MultiHeadAttention(
          (q_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (k_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (v_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (out_lin): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (4): MultiHeadAttention(
          (q_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (k_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (v_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (out_lin): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (5): MultiHeadAttention(
          (q_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (k_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (v_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (out_lin): Linear(in_features=1024, out_features=1024, bias=True)
        )
      )
      (layer_norm1): ModuleList(
        (0): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (1): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (2): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (3): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (4): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (5): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
      )
      (ffns): ModuleList(
        (0): TransformerFFN(
          (lin1): Linear(in_features=1024, out_features=4096, bias=True)
          (lin2): Linear(in_features=4096, out_features=1024, bias=True)
        )
        (1): TransformerFFN(
          (lin1): Linear(in_features=1024, out_features=4096, bias=True)
          (lin2): Linear(in_features=4096, out_features=1024, bias=True)
        )
        (2): TransformerFFN(
          (lin1): Linear(in_features=1024, out_features=4096, bias=True)
          (lin2): Linear(in_features=4096, out_features=1024, bias=True)
        )
        (3): TransformerFFN(
          (lin1): Linear(in_features=1024, out_features=4096, bias=True)
          (lin2): Linear(in_features=4096, out_features=1024, bias=True)
        )
        (4): TransformerFFN(
          (lin1): Linear(in_features=1024, out_features=4096, bias=True)
          (lin2): Linear(in_features=4096, out_features=1024, bias=True)
        )
        (5): TransformerFFN(
          (lin1): Linear(in_features=1024, out_features=4096, bias=True)
          (lin2): Linear(in_features=4096, out_features=1024, bias=True)
        )
      )
      (layer_norm2): ModuleList(
        (0): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (1): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (2): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (3): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (4): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (5): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
      )
      (memories): ModuleDict()
      (pred_layer): PredLayer(
        (proj): Linear(in_features=1024, out_features=64699, bias=True)
      )
    )
  )
  (decoder): TransformerDecoder(
    (embed_tokens): Embedding(64699, 1024, padding_idx=2)
    (embed_positions): SinusoidalPositionalEmbedding()
    (layers): ModuleList(
      (0): TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
      )
      (1): TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
      )
      (2): TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
      )
      (3): TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
      )
      (4): TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
      )
      (5): TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
      )
    )
  )
)
| model big_xlm_translation, criterion LabelSmoothedCrossEntropyCriterion
| num. model params: 375705787 (num. trained: 375705787)
| Trainable params.encoder.model.position_embeddings.weight encoder.model.lang_embeddings.weight    encoder.model.embeddings.weight encoder.model.layer_norm_emb.weight encoder.model.layer_norm_emb.bias   encoder.model.attentions.0.q_lin.weight encoder.model.attentions.0.q_lin.bias   encoder.model.attentions.0.k_lin.weight encoder.model.attentions.0.k_lin.bias   encoder.model.attentions.0.v_lin.weight encoder.model.attentions.0.v_lin.bias   encoder.model.attentions.0.out_lin.weight   encoder.model.attentions.0.out_lin.bias encoder.model.attentions.1.q_lin.weight encoder.model.attentions.1.q_lin.bias   encoder.model.attentions.1.k_lin.weight encoder.model.attentions.1.k_lin.bias   encoder.model.attentions.1.v_lin.weight encoder.model.attentions.1.v_lin.bias   encoder.model.attentions.1.out_lin.weight   encoder.model.attentions.1.out_lin.bias encoder.model.attentions.2.q_lin.weight encoder.model.attentions.2.q_lin.bias   encoder.model.attentions.2.k_lin.weight encoder.model.attentions.2.k_lin.bias   encoder.model.attentions.2.v_lin.weight encoder.model.attentions.2.v_lin.bias   encoder.model.attentions.2.out_lin.weight   encoder.model.attentions.2.out_lin.bias encoder.model.attentions.3.q_lin.weight encoder.model.attentions.3.q_lin.bias   encoder.model.attentions.3.k_lin.weight encoder.model.attentions.3.k_lin.bias   encoder.model.attentions.3.v_lin.weight encoder.model.attentions.3.v_lin.bias   encoder.model.attentions.3.out_lin.weight   encoder.model.attentions.3.out_lin.bias encoder.model.attentions.4.q_lin.weight encoder.model.attentions.4.q_lin.bias   encoder.model.attentions.4.k_lin.weight encoder.model.attentions.4.k_lin.bias   encoder.model.attentions.4.v_lin.weight encoder.model.attentions.4.v_lin.bias   encoder.model.attentions.4.out_lin.weight   encoder.model.attentions.4.out_lin.bias encoder.model.attentions.5.q_lin.weight encoder.model.attentions.5.q_lin.bias   encoder.model.attentions.5.k_lin.weight encoder.model.attentions.5.k_lin.bias   encoder.model.attentions.5.v_lin.weight encoder.model.attentions.5.v_lin.bias   encoder.model.attentions.5.out_lin.weight   encoder.model.attentions.5.out_lin.bias encoder.model.layer_norm1.0.weight  encoder.model.layer_norm1.0.bias    encoder.model.layer_norm1.1.weight  encoder.model.layer_norm1.1.bias    encoder.model.layer_norm1.2.weight  encoder.model.layer_norm1.2.bias    encoder.model.layer_norm1.3.weight  encoder.model.layer_norm1.3.bias    encoder.model.layer_norm1.4.weight  encoder.model.layer_norm1.4.bias    encoder.model.layer_norm1.5.weight  encoder.model.layer_norm1.5.bias    encoder.model.ffns.0.lin1.weight    encoder.model.ffns.0.lin1.bias  encoder.model.ffns.0.lin2.weight    encoder.model.ffns.0.lin2.bias  encoder.model.ffns.1.lin1.weight    encoder.model.ffns.1.lin1.bias  encoder.model.ffns.1.lin2.weight    encoder.model.ffns.1.lin2.bias  encoder.model.ffns.2.lin1.weight    encoder.model.ffns.2.lin1.bias  encoder.model.ffns.2.lin2.weight    encoder.model.ffns.2.lin2.bias  encoder.model.ffns.3.lin1.weight    encoder.model.ffns.3.lin1.bias  encoder.model.ffns.3.lin2.weight    encoder.model.ffns.3.lin2.bias  encoder.model.ffns.4.lin1.weight    encoder.model.ffns.4.lin1.bias  encoder.model.ffns.4.lin2.weight    encoder.model.ffns.4.lin2.bias  encoder.model.ffns.5.lin1.weight    encoder.model.ffns.5.lin1.bias  encoder.model.ffns.5.lin2.weight    encoder.model.ffns.5.lin2.bias  encoder.model.layer_norm2.0.weight  encoder.model.layer_norm2.0.bias    encoder.model.layer_norm2.1.weight  encoder.model.layer_norm2.1.bias    encoder.model.layer_norm2.2.weight  encoder.model.layer_norm2.2.bias    encoder.model.layer_norm2.3.weight  encoder.model.layer_norm2.3.bias    encoder.model.layer_norm2.4.weight  encoder.model.layer_norm2.4.bias    encoder.model.layer_norm2.5.weight  encoder.model.layer_norm2.5.bias    encoder.model.pred_layer.proj.bias  decoder.embed_out   decoder.embed_tokens.weight decoder.layers.0.self_attn.in_proj_weight   decoder.layers.0.self_attn.in_proj_bias decoder.layers.0.self_attn.out_proj.weight  decoder.layers.0.self_attn.out_proj.bias    decoder.layers.0.self_attn_layer_norm.weight    decoder.layers.0.self_attn_layer_norm.bias  decoder.layers.0.encoder_attn.in_proj_weight    decoder.layers.0.encoder_attn.in_proj_bias  decoder.layers.0.encoder_attn.out_proj.weight   decoder.layers.0.encoder_attn.out_proj.bias decoder.layers.0.encoder_attn_layer_norm.weight decoder.layers.0.encoder_attn_layer_norm.bias   decoder.layers.0.fc1.weight decoder.layers.0.fc1.bias   decoder.layers.0.fc2.weight decoder.layers.0.fc2.bias   decoder.layers.0.final_layer_norm.weight    decoder.layers.0.final_layer_norm.bias  decoder.layers.1.self_attn.in_proj_weight   decoder.layers.1.self_attn.in_proj_bias decoder.layers.1.self_attn.out_proj.weight  decoder.layers.1.self_attn.out_proj.bias    decoder.layers.1.self_attn_layer_norm.weight    decoder.layers.1.self_attn_layer_norm.bias  decoder.layers.1.encoder_attn.in_proj_weight    decoder.layers.1.encoder_attn.in_proj_bias  decoder.layers.1.encoder_attn.out_proj.weight   decoder.layers.1.encoder_attn.out_proj.bias decoder.layers.1.encoder_attn_layer_norm.weight decoder.layers.1.encoder_attn_layer_norm.bias   decoder.layers.1.fc1.weight decoder.layers.1.fc1.bias   decoder.layers.1.fc2.weight decoder.layers.1.fc2.bias   decoder.layers.1.final_layer_norm.weight    decoder.layers.1.final_layer_norm.bias  decoder.layers.2.self_attn.in_proj_weight   decoder.layers.2.self_attn.in_proj_bias decoder.layers.2.self_attn.out_proj.weight  decoder.layers.2.self_attn.out_proj.bias    decoder.layers.2.self_attn_layer_norm.weight    decoder.layers.2.self_attn_layer_norm.bias  decoder.layers.2.encoder_attn.in_proj_weight    decoder.layers.2.encoder_attn.in_proj_bias  decoder.layers.2.encoder_attn.out_proj.weight   decoder.layers.2.encoder_attn.out_proj.bias decoder.layers.2.encoder_attn_layer_norm.weight decoder.layers.2.encoder_attn_layer_norm.bias   decoder.layers.2.fc1.weight decoder.layers.2.fc1.bias   decoder.layers.2.fc2.weight decoder.layers.2.fc2.bias   decoder.layers.2.final_layer_norm.weight    decoder.layers.2.final_layer_norm.bias  decoder.layers.3.self_attn.in_proj_weight   decoder.layers.3.self_attn.in_proj_bias decoder.layers.3.self_attn.out_proj.weight  decoder.layers.3.self_attn.out_proj.bias    decoder.layers.3.self_attn_layer_norm.weight    decoder.layers.3.self_attn_layer_norm.bias  decoder.layers.3.encoder_attn.in_proj_weight    decoder.layers.3.encoder_attn.in_proj_bias  decoder.layers.3.encoder_attn.out_proj.weight   decoder.layers.3.encoder_attn.out_proj.bias decoder.layers.3.encoder_attn_layer_norm.weight decoder.layers.3.encoder_attn_layer_norm.bias   decoder.layers.3.fc1.weight decoder.layers.3.fc1.bias   decoder.layers.3.fc2.weight decoder.layers.3.fc2.bias   decoder.layers.3.final_layer_norm.weight    decoder.layers.3.final_layer_norm.bias  decoder.layers.4.self_attn.in_proj_weight   decoder.layers.4.self_attn.in_proj_bias decoder.layers.4.self_attn.out_proj.weight  decoder.layers.4.self_attn.out_proj.bias    decoder.layers.4.self_attn_layer_norm.weight    decoder.layers.4.self_attn_layer_norm.bias  decoder.layers.4.encoder_attn.in_proj_weight    decoder.layers.4.encoder_attn.in_proj_bias  decoder.layers.4.encoder_attn.out_proj.weight   decoder.layers.4.encoder_attn.out_proj.bias decoder.layers.4.encoder_attn_layer_norm.weight decoder.layers.4.encoder_attn_layer_norm.bias   decoder.layers.4.fc1.weight decoder.layers.4.fc1.bias   decoder.layers.4.fc2.weight decoder.layers.4.fc2.bias   decoder.layers.4.final_layer_norm.weight    decoder.layers.4.final_layer_norm.bias  decoder.layers.5.self_attn.in_proj_weight   decoder.layers.5.self_attn.in_proj_bias decoder.layers.5.self_attn.out_proj.weight  decoder.layers.5.self_attn.out_proj.bias    decoder.layers.5.self_attn_layer_norm.weight    decoder.layers.5.self_attn_layer_norm.bias  decoder.layers.5.encoder_attn.in_proj_weight    decoder.layers.5.encoder_attn.in_proj_bias  decoder.layers.5.encoder_attn.out_proj.weight   decoder.layers.5.encoder_attn.out_proj.bias decoder.layers.5.encoder_attn_layer_norm.weight decoder.layers.5.encoder_attn_layer_norm.bias   decoder.layers.5.fc1.weight decoder.layers.5.fc1.bias   decoder.layers.5.fc2.weight decoder.layers.5.fc2.bias   decoder.layers.5.final_layer_norm.weight    decoder.layers.5.final_layer_norm.bias
| training on 8 GPUs
| max tokens per GPU = 1536 and max sentences per GPU = None
| WARNING: 20 samples have invalid sizes and will be skipped, max_positions=(512, 512), first few sample ids=[1464114, 1797546, 1513624, 1365715, 1841665, 1553144, 1795797, 1643331, 1583565, 1704988]
| distributed init (rank 4): tcp://localhost:17056
Namespace(adam_betas='(0.9, 0.98)', adam_eps=1e-08, adaptive_input=False, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, arch='big_xlm_translation', article_transformer_decoder_layer=False, attention_dropout=0.0, attn_fertility_method=None, bucket_cap_mb=25, clip_norm=0.0, continue_n_epoch=100000000, cpu=False, criterion='label_smoothed_cross_entropy', curriculum=0, data=['data-bin/xlm_pre_NN.ende'], ddp_backend='no_c10d', decoder_attention_heads=16, decoder_embed_dim=1024, decoder_embed_path=None, decoder_ffn_embed_dim=4096, decoder_input_dim=1024, decoder_layers=6, decoder_learned_pos=False, decoder_normalize_before=False, decoder_output_dim=1024, dependency_nn_layers=None, device_id=3, distributed_backend='nccl', distributed_init_method='tcp://localhost:17056', distributed_port=-1, distributed_rank=3, distributed_world_size=8, dropout=0.3, encoder_attention_heads=16, encoder_embed_dim=1024, encoder_embed_path=None, encoder_ffn_embed_dim=4096, encoder_layers=6, encoder_learned_pos=False, encoder_normalize_before=False, encoder_trainable='True', eval_gpu=99, extract_from_encoder_out_plus_word_embedding='False', extract_selfatt_layers=None, extracta_nodrop=False, extracta_path=None, fix_batches_to_gpus=False, fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, gold_data=None, gumbel_softmax_warm=None, keep_interval_updates=-1, keep_last_epochs=-1, label_smoothing=0.1, lambda_attention_weight_dependency=None, lambda_attn_entropy=None, lambda_attn_fertility_loss=None, lambda_enc_weight=None, lazy_load=False, left_pad_source='True', left_pad_target='False', log_format=None, log_interval=1000, lr=[0.001], lr_scheduler='inverse_sqrt', lr_shrink=0.1, max_a_method='', max_epoch=0, max_sentences=None, max_sentences_valid=None, max_source_positions=512, max_target_positions=512, max_tokens=1536, max_update=0, memory_efficient_fp16=False, min_loss_scale=0.0001, min_lr=1e-09, momentum=0.99, no_epoch_checkpoints=False, no_progress_bar=False, no_save=False, no_token_positional_embeddings=False, no_update_n_model=5, num_workers=0, optimizer='adam', optimizer_overrides='{}', output_extra_loss_ratio=False, raw_text=False, reload_xlm_ckpt='ckpts/mlm_ende_1024.pth', relu_dropout=0.0, required_batch_size_multiple=8, reset_lr_scheduler=False, reset_optimizer=False, restore_a_only_init=False, restore_NN_layer='', restore_file='checkpoint_last.pt', restore_max_a=False, restore_transformer='', save_dir='ckpts/ende_xlm_translation', save_interval=1, save_interval_updates=3000, seed=1, sentence_avg=False, share_all_embeddings=False, share_decoder_input_output_embed=False, skip_invalid_size_inputs_valid_test=False, sol3_lambda_dynamic_window_size=None, sol3_lambda_window_loss=None, sol3_window_size=None, sol4_dynamic2d=None, sol4_lambda_one2d=None, sol4_one2d=None, source_lang='en', src_act_path=None, store_attention_matrix='', target_lang='de', task='translation', tensorboard_logdir='', threshold_loss_scale=None, train_NN_layer=False, train_subset='train', update_freq=[28], upsample_primary=1, user_dir=None, valid_subset='valid', validate_interval=1, warmup_init_lr=1e-07, warmup_updates=4000, weight_decay=0.0)
| [en] dictionary: 64699 types
| [de] dictionary: 64699 types
| data-bin/xlm_pre_NN.ende train 1907028 examples
| data-bin/xlm_pre_NN.ende valid 508 examples
TransformerNNmentLayerModel(
  (encoder): XLM_Encoder(
    (model): TransformerModel(
      (position_embeddings): Embedding(512, 1024)
      (lang_embeddings): Embedding(2, 1024)
      (embeddings): Embedding(64699, 1024, padding_idx=2)
      (layer_norm_emb): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
      (attentions): ModuleList(
        (0): MultiHeadAttention(
          (q_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (k_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (v_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (out_lin): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (1): MultiHeadAttention(
          (q_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (k_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (v_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (out_lin): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (2): MultiHeadAttention(
          (q_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (k_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (v_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (out_lin): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (3): MultiHeadAttention(
          (q_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (k_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (v_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (out_lin): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (4): MultiHeadAttention(
          (q_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (k_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (v_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (out_lin): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (5): MultiHeadAttention(
          (q_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (k_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (v_lin): Linear(in_features=1024, out_features=1024, bias=True)
          (out_lin): Linear(in_features=1024, out_features=1024, bias=True)
        )
      )
      (layer_norm1): ModuleList(
        (0): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (1): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (2): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (3): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (4): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (5): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
      )
      (ffns): ModuleList(
        (0): TransformerFFN(
          (lin1): Linear(in_features=1024, out_features=4096, bias=True)
          (lin2): Linear(in_features=4096, out_features=1024, bias=True)
        )
        (1): TransformerFFN(
          (lin1): Linear(in_features=1024, out_features=4096, bias=True)
          (lin2): Linear(in_features=4096, out_features=1024, bias=True)
        )
        (2): TransformerFFN(
          (lin1): Linear(in_features=1024, out_features=4096, bias=True)
          (lin2): Linear(in_features=4096, out_features=1024, bias=True)
        )
        (3): TransformerFFN(
          (lin1): Linear(in_features=1024, out_features=4096, bias=True)
          (lin2): Linear(in_features=4096, out_features=1024, bias=True)
        )
        (4): TransformerFFN(
          (lin1): Linear(in_features=1024, out_features=4096, bias=True)
          (lin2): Linear(in_features=4096, out_features=1024, bias=True)
        )
        (5): TransformerFFN(
          (lin1): Linear(in_features=1024, out_features=4096, bias=True)
          (lin2): Linear(in_features=4096, out_features=1024, bias=True)
        )
      )
      (layer_norm2): ModuleList(
        (0): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (1): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (2): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (3): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (4): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
        (5): LayerNorm(torch.Size([1024]), eps=1e-12, elementwise_affine=True)
      )
      (memories): ModuleDict()
      (pred_layer): PredLayer(
        (proj): Linear(in_features=1024, out_features=64699, bias=True)
      )
    )
  )
  (decoder): TransformerDecoder(
    (embed_tokens): Embedding(64699, 1024, padding_idx=2)
    (embed_positions): SinusoidalPositionalEmbedding()
    (layers): ModuleList(
      (0): TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
      )
      (1): TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
      )
      (2): TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
      )
      (3): TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
      )
      (4): TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
      )
      (5): TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
      )
    )
  )
)
| model big_xlm_translation, criterion LabelSmoothedCrossEntropyCriterion
| num. model params: 375705787 (num. trained: 375705787)
| Trainable params.encoder.model.position_embeddings.weight encoder.model.lang_embeddings.weight    encoder.model.embeddings.weight encoder.model.layer_norm_emb.weight encoder.model.layer_norm_emb.bias   encoder.model.attentions.0.q_lin.weight encoder.model.attentions.0.q_lin.bias   encoder.model.attentions.0.k_lin.weight encoder.model.attentions.0.k_lin.bias   encoder.model.attentions.0.v_lin.weight encoder.model.attentions.0.v_lin.bias   encoder.model.attentions.0.out_lin.weight   encoder.model.attentions.0.out_lin.bias encoder.model.attentions.1.q_lin.weight encoder.model.attentions.1.q_lin.bias   encoder.model.attentions.1.k_lin.weight encoder.model.attentions.1.k_lin.bias   encoder.model.attentions.1.v_lin.weight encoder.model.attentions.1.v_lin.bias   encoder.model.attentions.1.out_lin.weight   encoder.model.attentions.1.out_lin.bias encoder.model.attentions.2.q_lin.weight encoder.model.attentions.2.q_lin.bias   encoder.model.attentions.2.k_lin.weight encoder.model.attentions.2.k_lin.bias   encoder.model.attentions.2.v_lin.weight encoder.model.attentions.2.v_lin.bias   encoder.model.attentions.2.out_lin.weight   encoder.model.attentions.2.out_lin.bias encoder.model.attentions.3.q_lin.weight encoder.model.attentions.3.q_lin.bias   encoder.model.attentions.3.k_lin.weight encoder.model.attentions.3.k_lin.bias   encoder.model.attentions.3.v_lin.weight encoder.model.attentions.3.v_lin.bias   encoder.model.attentions.3.out_lin.weight   encoder.model.attentions.3.out_lin.bias encoder.model.attentions.4.q_lin.weight encoder.model.attentions.4.q_lin.bias   encoder.model.attentions.4.k_lin.weight encoder.model.attentions.4.k_lin.bias   encoder.model.attentions.4.v_lin.weight encoder.model.attentions.4.v_lin.bias   encoder.model.attentions.4.out_lin.weight   encoder.model.attentions.4.out_lin.bias encoder.model.attentions.5.q_lin.weight encoder.model.attentions.5.q_lin.bias   encoder.model.attentions.5.k_lin.weight encoder.model.attentions.5.k_lin.bias   encoder.model.attentions.5.v_lin.weight encoder.model.attentions.5.v_lin.bias   encoder.model.attentions.5.out_lin.weight   encoder.model.attentions.5.out_lin.bias encoder.model.layer_norm1.0.weight  encoder.model.layer_norm1.0.bias    encoder.model.layer_norm1.1.weight  encoder.model.layer_norm1.1.bias    encoder.model.layer_norm1.2.weight  encoder.model.layer_norm1.2.bias    encoder.model.layer_norm1.3.weight  encoder.model.layer_norm1.3.bias    encoder.model.layer_norm1.4.weight  encoder.model.layer_norm1.4.bias    encoder.model.layer_norm1.5.weight  encoder.model.layer_norm1.5.bias    encoder.model.ffns.0.lin1.weight    encoder.model.ffns.0.lin1.bias  encoder.model.ffns.0.lin2.weight    encoder.model.ffns.0.lin2.bias  encoder.model.ffns.1.lin1.weight    encoder.model.ffns.1.lin1.bias  encoder.model.ffns.1.lin2.weight    encoder.model.ffns.1.lin2.bias  encoder.model.ffns.2.lin1.weight    encoder.model.ffns.2.lin1.bias  encoder.model.ffns.2.lin2.weight    encoder.model.ffns.2.lin2.bias  encoder.model.ffns.3.lin1.weight    encoder.model.ffns.3.lin1.bias  encoder.model.ffns.3.lin2.weight    encoder.model.ffns.3.lin2.bias  encoder.model.ffns.4.lin1.weight    encoder.model.ffns.4.lin1.bias  encoder.model.ffns.4.lin2.weight    encoder.model.ffns.4.lin2.bias  encoder.model.ffns.5.lin1.weight    encoder.model.ffns.5.lin1.bias  encoder.model.ffns.5.lin2.weight    encoder.model.ffns.5.lin2.bias  encoder.model.layer_norm2.0.weight  encoder.model.layer_norm2.0.bias    encoder.model.layer_norm2.1.weight  encoder.model.layer_norm2.1.bias    encoder.model.layer_norm2.2.weight  encoder.model.layer_norm2.2.bias    encoder.model.layer_norm2.3.weight  encoder.model.layer_norm2.3.bias    encoder.model.layer_norm2.4.weight  encoder.model.layer_norm2.4.bias    encoder.model.layer_norm2.5.weight  encoder.model.layer_norm2.5.bias    encoder.model.pred_layer.proj.bias  decoder.embed_out   decoder.embed_tokens.weight decoder.layers.0.self_attn.in_proj_weight   decoder.layers.0.self_attn.in_proj_bias decoder.layers.0.self_attn.out_proj.weight  decoder.layers.0.self_attn.out_proj.bias    decoder.layers.0.self_attn_layer_norm.weight    decoder.layers.0.self_attn_layer_norm.bias  decoder.layers.0.encoder_attn.in_proj_weight    decoder.layers.0.encoder_attn.in_proj_bias  decoder.layers.0.encoder_attn.out_proj.weight   decoder.layers.0.encoder_attn.out_proj.bias decoder.layers.0.encoder_attn_layer_norm.weight decoder.layers.0.encoder_attn_layer_norm.bias   decoder.layers.0.fc1.weight decoder.layers.0.fc1.bias   decoder.layers.0.fc2.weight decoder.layers.0.fc2.bias   decoder.layers.0.final_layer_norm.weight    decoder.layers.0.final_layer_norm.bias  decoder.layers.1.self_attn.in_proj_weight   decoder.layers.1.self_attn.in_proj_bias decoder.layers.1.self_attn.out_proj.weight  decoder.layers.1.self_attn.out_proj.bias    decoder.layers.1.self_attn_layer_norm.weight    decoder.layers.1.self_attn_layer_norm.bias  decoder.layers.1.encoder_attn.in_proj_weight    decoder.layers.1.encoder_attn.in_proj_bias  decoder.layers.1.encoder_attn.out_proj.weight   decoder.layers.1.encoder_attn.out_proj.bias decoder.layers.1.encoder_attn_layer_norm.weight decoder.layers.1.encoder_attn_layer_norm.bias   decoder.layers.1.fc1.weight decoder.layers.1.fc1.bias   decoder.layers.1.fc2.weight decoder.layers.1.fc2.bias   decoder.layers.1.final_layer_norm.weight    decoder.layers.1.final_layer_norm.bias  decoder.layers.2.self_attn.in_proj_weight   decoder.layers.2.self_attn.in_proj_bias decoder.layers.2.self_attn.out_proj.weight  decoder.layers.2.self_attn.out_proj.bias    decoder.layers.2.self_attn_layer_norm.weight    decoder.layers.2.self_attn_layer_norm.bias  decoder.layers.2.encoder_attn.in_proj_weight    decoder.layers.2.encoder_attn.in_proj_bias  decoder.layers.2.encoder_attn.out_proj.weight   decoder.layers.2.encoder_attn.out_proj.bias decoder.layers.2.encoder_attn_layer_norm.weight decoder.layers.2.encoder_attn_layer_norm.bias   decoder.layers.2.fc1.weight decoder.layers.2.fc1.bias   decoder.layers.2.fc2.weight decoder.layers.2.fc2.bias   decoder.layers.2.final_layer_norm.weight    decoder.layers.2.final_layer_norm.bias  decoder.layers.3.self_attn.in_proj_weight   decoder.layers.3.self_attn.in_proj_bias decoder.layers.3.self_attn.out_proj.weight  decoder.layers.3.self_attn.out_proj.bias    decoder.layers.3.self_attn_layer_norm.weight    decoder.layers.3.self_attn_layer_norm.bias  decoder.layers.3.encoder_attn.in_proj_weight    decoder.layers.3.encoder_attn.in_proj_bias  decoder.layers.3.encoder_attn.out_proj.weight   decoder.layers.3.encoder_attn.out_proj.bias decoder.layers.3.encoder_attn_layer_norm.weight decoder.layers.3.encoder_attn_layer_norm.bias   decoder.layers.3.fc1.weight decoder.layers.3.fc1.bias   decoder.layers.3.fc2.weight decoder.layers.3.fc2.bias   decoder.layers.3.final_layer_norm.weight    decoder.layers.3.final_layer_norm.bias  decoder.layers.4.self_attn.in_proj_weight   decoder.layers.4.self_attn.in_proj_bias decoder.layers.4.self_attn.out_proj.weight  decoder.layers.4.self_attn.out_proj.bias    decoder.layers.4.self_attn_layer_norm.weight    decoder.layers.4.self_attn_layer_norm.bias  decoder.layers.4.encoder_attn.in_proj_weight    decoder.layers.4.encoder_attn.in_proj_bias  decoder.layers.4.encoder_attn.out_proj.weight   decoder.layers.4.encoder_attn.out_proj.bias decoder.layers.4.encoder_attn_layer_norm.weight decoder.layers.4.encoder_attn_layer_norm.bias   decoder.layers.4.fc1.weight decoder.layers.4.fc1.bias   decoder.layers.4.fc2.weight decoder.layers.4.fc2.bias   decoder.layers.4.final_layer_norm.weight    decoder.layers.4.final_layer_norm.bias  decoder.layers.5.self_attn.in_proj_weight   decoder.layers.5.self_attn.in_proj_bias decoder.layers.5.self_attn.out_proj.weight  decoder.layers.5.self_attn.out_proj.bias    decoder.layers.5.self_attn_layer_norm.weight    decoder.layers.5.self_attn_layer_norm.bias  decoder.layers.5.encoder_attn.in_proj_weight    decoder.layers.5.encoder_attn.in_proj_bias  decoder.layers.5.encoder_attn.out_proj.weight   decoder.layers.5.encoder_attn.out_proj.bias decoder.layers.5.encoder_attn_layer_norm.weight decoder.layers.5.encoder_attn_layer_norm.bias   decoder.layers.5.fc1.weight decoder.layers.5.fc1.bias   decoder.layers.5.fc2.weight decoder.layers.5.fc2.bias   decoder.layers.5.final_layer_norm.weight    decoder.layers.5.final_layer_norm.bias
| training on 8 GPUs
| max tokens per GPU = 1536 and max sentences per GPU = None
| WARNING: 20 samples have invalid sizes and will be skipped, max_positions=(512, 512), first few sample ids=[1464114, 1797546, 1513624, 1365715, 1841665, 1553144, 1795797, 1643331, 1583565, 1704988]
| distributed init (rank 3): tcp://localhost:17056
| initialized host g48r6.tranx.nt12 as rank 0
| no existing checkpoint found ckpts/ende_xlm_translation/checkpoint_last.pt
| max_epoch inf
lr:True epoch:True  update:True
| epoch 001 | loss 12.148 | nll_loss 11.561 | ppl 3020.38 | wps 27427 | ups 0 | wpb 281028.659 | bsz 8438.088 | num_updates 226 | lr 5.65943e-05 | gnorm 1.582 | clip 0.000 | oom 0.000 | wall 2332 | train_wall 2269
| epoch 001 | valid on 'valid' subset | loss 9.421 | nll_loss 8.334 | ppl 322.71 | num_updates 226
| saved checkpoint ckpts/ende_xlm_translation/checkpoint1.pt (epoch 1 @ 226 updates) (writing took 183.28358149528503 seconds)
| epoch 002 | loss 8.246 | nll_loss 7.050 | ppl 132.53 | wps 27439 | ups 0 | wpb 281028.659 | bsz 8438.088 | num_updates 452 | lr 0.000113089 | gnorm 1.477 | clip 0.000 | oom 0.000 | wall 4834 | train_wall 4536
| epoch 002 | valid on 'valid' subset | loss 7.154 | nll_loss 5.639 | ppl 49.85 | num_updates 452 | best_loss 7.15413
| saved checkpoint ckpts/ende_xlm_translation/checkpoint2.pt (epoch 2 @ 452 updates) (writing took 206.17210292816162 seconds)
| epoch 003 | loss 6.145 | nll_loss 4.657 | ppl 25.23 | wps 27443 | ups 0 | wpb 281028.659 | bsz 8438.088 | num_updates 678 | lr 0.000169583 | gnorm 1.684 | clip 0.000 | oom 0.000 | wall 7357 | train_wall 6803
| epoch 003 | valid on 'valid' subset | loss 5.688 | nll_loss 3.971 | ppl 15.68 | num_updates 678 | best_loss 5.68826
| saved checkpoint ckpts/ende_xlm_translation/checkpoint3.pt (epoch 3 @ 678 updates) (writing took 222.79179120063782 seconds)
| epoch 004 | loss 5.093 | nll_loss 3.462 | ppl 11.02 | wps 27411 | ups 0 | wpb 281028.659 | bsz 8438.088 | num_updates 904 | lr 0.000226077 | gnorm 1.219 | clip 0.000 | oom 0.000 | wall 9900 | train_wall 9073
| epoch 004 | valid on 'valid' subset | loss 5.093 | nll_loss 3.290 | ppl 9.78 | num_updates 904 | best_loss 5.09263
| saved checkpoint ckpts/ende_xlm_translation/checkpoint4.pt (epoch 4 @ 904 updates) (writing took 195.94781613349915 seconds)
| epoch 005 | loss 4.547 | nll_loss 2.846 | ppl 7.19 | wps 27452 | ups 0 | wpb 281028.659 | bsz 8438.088 | num_updates 1130 | lr 0.000282572 | gnorm 0.836 | clip 0.000 | oom 0.000 | wall 12413 | train_wall 11339
| epoch 005 | valid on 'valid' subset | loss 5.123 | nll_loss 3.347 | ppl 10.18 | num_updates 1130 | best_loss 5.09263
| saved checkpoint ckpts/ende_xlm_translation/checkpoint5.pt (epoch 5 @ 1130 updates) (writing took 131.3478319644928 seconds)
| epoch 006 | loss 4.242 | nll_loss 2.506 | ppl 5.68 | wps 27419 | ups 0 | wpb 281028.659 | bsz 8438.088 | num_updates 1356 | lr 0.000339066 | gnorm 0.632 | clip 0.000 | oom 0.000 | wall 14864 | train_wall 13607
| epoch 006 | valid on 'valid' subset | loss 5.079 | nll_loss 3.326 | ppl 10.03 | num_updates 1356 | best_loss 5.07872
| saved checkpoint ckpts/ende_xlm_translation/checkpoint6.pt (epoch 6 @ 1356 updates) (writing took 197.691486120224 seconds)
| epoch 007 | loss 4.022 | nll_loss 2.263 | ppl 4.80 | wps 27447 | ups 0 | wpb 281028.659 | bsz 8438.088 | num_updates 1582 | lr 0.00039556 | gnorm 0.504 | clip 0.000 | oom 0.000 | wall 17380 | train_wall 15874
| epoch 007 | valid on 'valid' subset | loss 5.027 | nll_loss 3.273 | ppl 9.67 | num_updates 1582 | best_loss 5.02663
| saved checkpoint ckpts/ende_xlm_translation/checkpoint7.pt (epoch 7 @ 1582 updates) (writing took 205.56658792495728 seconds)
| epoch 008 | loss 4.066 | nll_loss 2.312 | ppl 4.97 | wps 27457 | ups 0 | wpb 281028.659 | bsz 8438.088 | num_updates 1808 | lr 0.000452055 | gnorm 0.750 | clip 0.000 | oom 0.000 | wall 19902 | train_wall 18139
| epoch 008 | valid on 'valid' subset | loss 4.774 | nll_loss 3.010 | ppl 8.06 | num_updates 1808 | best_loss 4.77382
| saved checkpoint ckpts/ende_xlm_translation/checkpoint8.pt (epoch 8 @ 1808 updates) (writing took 162.43307328224182 seconds)
| epoch 009 | loss 3.820 | nll_loss 2.043 | ppl 4.12 | wps 27428 | ups 0 | wpb 281028.659 | bsz 8438.088 | num_updates 2034 | lr 0.000508549 | gnorm 0.382 | clip 0.000 | oom 0.000 | wall 22383 | train_wall 20406
| epoch 009 | valid on 'valid' subset | loss 4.530 | nll_loss 2.728 | ppl 6.63 | num_updates 2034 | best_loss 4.53034
| saved checkpoint ckpts/ende_xlm_translation/checkpoint9.pt (epoch 9 @ 2034 updates) (writing took 364.2120928764343 seconds)
| epoch 010 | loss 3.736 | nll_loss 1.952 | ppl 3.87 | wps 27431 | ups 0 | wpb 281028.659 | bsz 8438.088 | num_updates 2260 | lr 0.000565043 | gnorm 0.373 | clip 0.000 | oom 0.000 | wall 25066 | train_wall 22673
| epoch 010 | valid on 'valid' subset | loss 4.207 | nll_loss 2.358 | ppl 5.13 | num_updates 2260 | best_loss 4.20735
| saved checkpoint ckpts/ende_xlm_translation/checkpoint10.pt (epoch 10 @ 2260 updates) (writing took 354.6733078956604 seconds)
| epoch 011 | loss 3.662 | nll_loss 1.872 | ppl 3.66 | wps 27425 | ups 0 | wpb 281028.659 | bsz 8438.088 | num_updates 2486 | lr 0.000621538 | gnorm 0.353 | clip 0.000 | oom 0.000 | wall 27740 | train_wall 24941
| epoch 011 | valid on 'valid' subset | loss 4.326 | nll_loss 2.507 | ppl 5.69 | num_updates 2486 | best_loss 4.20735
| saved checkpoint ckpts/ende_xlm_translation/checkpoint11.pt (epoch 11 @ 2486 updates) (writing took 231.27847623825073 seconds)
| epoch 012 | loss 3.607 | nll_loss 1.813 | ppl 3.51 | wps 27417 | ups 0 | wpb 281028.659 | bsz 8438.088 | num_updates 2712 | lr 0.000678032 | gnorm 0.339 | clip 0.000 | oom 0.000 | wall 30291 | train_wall 27209
| epoch 012 | valid on 'valid' subset | loss 4.105 | nll_loss 2.258 | ppl 4.78 | num_updates 2712 | best_loss 4.10489
| saved checkpoint ckpts/ende_xlm_translation/checkpoint12.pt (epoch 12 @ 2712 updates) (writing took 374.0683243274689 seconds)
| epoch 013 | loss 3.575 | nll_loss 1.779 | ppl 3.43 | wps 27420 | ups 0 | wpb 281028.659 | bsz 8438.088 | num_updates 2938 | lr 0.000734527 | gnorm 0.353 | clip 0.000 | oom 0.000 | wall 32984 | train_wall 29478
| epoch 013 | valid on 'valid' subset | loss 4.173 | nll_loss 2.356 | ppl 5.12 | num_updates 2938 | best_loss 4.10489
| saved checkpoint ckpts/ende_xlm_translation/checkpoint13.pt (epoch 13 @ 2938 updates) (writing took 305.80383133888245 seconds)
| epoch 014 | valid on 'valid' subset | loss 4.161 | nll_loss 2.296 | ppl 4.91 | num_updates 3000 | best_loss 4.10489
| saved checkpoint ckpts/ende_xlm_translation/checkpoint_14_3000.pt (epoch 14 @ 3000 updates) (writing took 290.9827673435211 seconds)
| epoch 014 | loss 3.535 | nll_loss 1.736 | ppl 3.33 | wps 24349 | ups 0 | wpb 281028.659 | bsz 8438.088 | num_updates 3164 | lr 0.000791021 | gnorm 0.340 | clip 0.000 | oom 0.000 | wall 35900 | train_wall 31745
| epoch 014 | valid on 'valid' subset | loss 4.077 | nll_loss 2.247 | ppl 4.75 | num_updates 3164 | best_loss 4.07664
| saved checkpoint ckpts/ende_xlm_translation/checkpoint14.pt (epoch 14 @ 3164 updates) (writing took 319.8115351200104 seconds)
| epoch 015 | loss 3.506 | nll_loss 1.705 | ppl 3.26 | wps 27459 | ups 0 | wpb 281028.659 | bsz 8438.088 | num_updates 3390 | lr 0.000847515 | gnorm 0.340 | clip 0.000 | oom 0.000 | wall 38536 | train_wall 34008
| epoch 015 | valid on 'valid' subset | loss 4.059 | nll_loss 2.236 | ppl 4.71 | num_updates 3390 | best_loss 4.05935
| saved checkpoint ckpts/ende_xlm_translation/checkpoint15.pt (epoch 15 @ 3390 updates) (writing took 378.57691860198975 seconds)
| epoch 016 | loss 3.484 | nll_loss 1.681 | ppl 3.21 | wps 27443 | ups 0 | wpb 281028.659 | bsz 8438.088 | num_updates 3616 | lr 0.00090401 | gnorm 0.338 | clip 0.000 | oom 0.000 | wall 41233 | train_wall 36275
| epoch 016 | valid on 'valid' subset | loss 4.235 | nll_loss 2.443 | ppl 5.44 | num_updates 3616 | best_loss 4.05935
| saved checkpoint ckpts/ende_xlm_translation/checkpoint16.pt (epoch 16 @ 3616 updates) (writing took 244.0232744216919 seconds)
| epoch 017 | loss 3.468 | nll_loss 1.664 | ppl 3.17 | wps 27425 | ups 0 | wpb 281028.659 | bsz 8438.088 | num_updates 3842 | lr 0.000960504 | gnorm 0.360 | clip 0.000 | oom 0.000 | wall 43796 | train_wall 38541
| epoch 017 | valid on 'valid' subset | loss 4.159 | nll_loss 2.373 | ppl 5.18 | num_updates 3842 | best_loss 4.05935
| saved checkpoint ckpts/ende_xlm_translation/checkpoint17.pt (epoch 17 @ 3842 updates) (writing took 258.8501696586609 seconds)
| epoch 018 | loss 3.453 | nll_loss 1.648 | ppl 3.13 | wps 27413 | ups 0 | wpb 281028.659 | bsz 8438.088 | num_updates 4068 | lr 0.000991607 | gnorm 0.357 | clip 0.000 | oom 0.000 | wall 46375 | train_wall 40810
| epoch 018 | valid on 'valid' subset | loss 4.086 | nll_loss 2.281 | ppl 4.86 | num_updates 4068 | best_loss 4.05935
| saved checkpoint ckpts/ende_xlm_translation/checkpoint18.pt (epoch 18 @ 4068 updates) (writing took 258.033367395401 seconds)
| epoch 019 | loss 3.432 | nll_loss 1.625 | ppl 3.08 | wps 27444 | ups 0 | wpb 281028.659 | bsz 8438.088 | num_updates 4294 | lr 0.000965159 | gnorm 0.345 | clip 0.000 | oom 0.000 | wall 48950 | train_wall 43075
| epoch 019 | valid on 'valid' subset | loss 4.092 | nll_loss 2.280 | ppl 4.86 | num_updates 4294 | best_loss 4.05935
| saved checkpoint ckpts/ende_xlm_translation/checkpoint19.pt (epoch 19 @ 4294 updates) (writing took 261.3803300857544 seconds)
| epoch 020 | loss 3.406 | nll_loss 1.597 | ppl 3.03 | wps 27433 | ups 0 | wpb 281028.659 | bsz 8438.088 | num_updates 4520 | lr 0.000940721 | gnorm 0.338 | clip 0.000 | oom 0.000 | wall 51530 | train_wall 45340
| epoch 020 | valid on 'valid' subset | loss 4.098 | nll_loss 2.305 | ppl 4.94 | num_updates 4520 | best_loss 4.05935
| saved checkpoint ckpts/ende_xlm_translation/checkpoint20.pt (epoch 20 @ 4520 updates) (writing took 265.834433555603 seconds)
| epoch 021 | loss 3.383 | nll_loss 1.572 | ppl 2.97 | wps 27460 | ups 0 | wpb 281028.659 | bsz 8438.088 | num_updates 4746 | lr 0.00091805 | gnorm 0.336 | clip 0.000 | oom 0.000 | wall 54112 | train_wall 47604
| epoch 021 | valid on 'valid' subset | loss 4.078 | nll_loss 2.269 | ppl 4.82 | num_updates 4746 | best_loss 4.05935
| saved checkpoint ckpts/ende_xlm_translation/checkpoint21.pt (epoch 21 @ 4746 updates) (writing took 249.63947892189026 seconds)
| epoch 022 | loss 3.360 | nll_loss 1.547 | ppl 2.92 | wps 27438 | ups 0 | wpb 281028.659 | bsz 8438.088 | num_updates 4972 | lr 0.000896942 | gnorm 0.331 | clip 0.000 | oom 0.000 | wall 56679 | train_wall 49872
| epoch 022 | valid on 'valid' subset | loss 4.077 | nll_loss 2.286 | ppl 4.88 | num_updates 4972 | best_loss 4.05935
| saved checkpoint ckpts/ende_xlm_translation/checkpoint22.pt (epoch 22 @ 4972 updates) (writing took 265.4704740047455 seconds)
| epoch 023 | loss 3.339 | nll_loss 1.523 | ppl 2.87 | wps 27433 | ups 0 | wpb 281028.659 | bsz 8438.088 | num_updates 5198 | lr 0.000877227 | gnorm 0.328 | clip 0.000 | oom 0.000 | wall 59263 | train_wall 52138
| epoch 023 | valid on 'valid' subset | loss 4.054 | nll_loss 2.260 | ppl 4.79 | num_updates 5198 | best_loss 4.0541
| saved checkpoint ckpts/ende_xlm_translation/checkpoint23.pt (epoch 23 @ 5198 updates) (writing took 374.25131940841675 seconds)
| epoch 024 | loss 3.320 | nll_loss 1.502 | ppl 2.83 | wps 27451 | ups 0 | wpb 281028.659 | bsz 8438.088 | num_updates 5424 | lr 0.000858757 | gnorm 0.328 | clip 0.000 | oom 0.000 | wall 61955 | train_wall 54404
| epoch 024 | valid on 'valid' subset | loss 4.011 | nll_loss 2.203 | ppl 4.61 | num_updates 5424 | best_loss 4.01082
| saved checkpoint ckpts/ende_xlm_translation/checkpoint24.pt (epoch 24 @ 5424 updates) (writing took 348.1226255893707 seconds)
| epoch 025 | loss 3.301 | nll_loss 1.481 | ppl 2.79 | wps 27463 | ups 0 | wpb 281028.659 | bsz 8438.088 | num_updates 5650 | lr 0.000841406 | gnorm 0.321 | clip 0.000 | oom 0.000 | wall 64619 | train_wall 56669
| epoch 025 | valid on 'valid' subset | loss 4.046 | nll_loss 2.255 | ppl 4.77 | num_updates 5650 | best_loss 4.01082
| saved checkpoint ckpts/ende_xlm_translation/checkpoint25.pt (epoch 25 @ 5650 updates) (writing took 266.49271941185 seconds)
| epoch 026 | loss 3.283 | nll_loss 1.462 | ppl 2.75 | wps 27451 | ups 0 | wpb 281028.659 | bsz 8438.088 | num_updates 5876 | lr 0.000825067 | gnorm 0.323 | clip 0.000 | oom 0.000 | wall 67203 | train_wall 58934
| epoch 026 | valid on 'valid' subset | loss 4.031 | nll_loss 2.234 | ppl 4.70 | num_updates 5876 | best_loss 4.01082
| saved checkpoint ckpts/ende_xlm_translation/checkpoint26.pt (epoch 26 @ 5876 updates) (writing took 289.72882413864136 seconds)
| epoch 027 | valid on 'valid' subset | loss 4.055 | nll_loss 2.264 | ppl 4.80 | num_updates 6000 | best_loss 4.01082
| saved checkpoint ckpts/ende_xlm_translation/checkpoint_27_6000.pt (epoch 27 @ 6000 updates) (writing took 305.3397607803345 seconds)
| epoch 027 | loss 3.266 | nll_loss 1.443 | ppl 2.72 | wps 24229 | ups 0 | wpb 281028.659 | bsz 8438.088 | num_updates 6102 | lr 0.000809644 | gnorm 0.320 | clip 0.000 | oom 0.000 | wall 70116 | train_wall 61201
| epoch 027 | valid on 'valid' subset | loss 4.082 | nll_loss 2.290 | ppl 4.89 | num_updates 6102 | best_loss 4.01082
| saved checkpoint ckpts/ende_xlm_translation/checkpoint27.pt (epoch 27 @ 6102 updates) (writing took 351.67931962013245 seconds)
| epoch 028 | loss 3.251 | nll_loss 1.426 | ppl 2.69 | wps 27431 | ups 0 | wpb 281028.659 | bsz 8438.088 | num_updates 6328 | lr 0.000795054 | gnorm 0.324 | clip 0.000 | oom 0.000 | wall 72786 | train_wall 63468
| epoch 028 | valid on 'valid' subset | loss 3.978 | nll_loss 2.177 | ppl 4.52 | num_updates 6328 | best_loss 3.97833
| saved checkpoint ckpts/ende_xlm_translation/checkpoint28.pt (epoch 28 @ 6328 updates) (writing took 343.29754996299744 seconds)
| epoch 029 | loss 3.236 | nll_loss 1.409 | ppl 2.66 | wps 27446 | ups 0 | wpb 281028.659 | bsz 8438.088 | num_updates 6554 | lr 0.000781226 | gnorm 0.315 | clip 0.000 | oom 0.000 | wall 75446 | train_wall 65734
| epoch 029 | valid on 'valid' subset | loss 4.075 | nll_loss 2.284 | ppl 4.87 | num_updates 6554 | best_loss 3.97833
| saved checkpoint ckpts/ende_xlm_translation/checkpoint29.pt (epoch 29 @ 6554 updates) (writing took 260.97370624542236 seconds)
| epoch 030 | loss 3.222 | nll_loss 1.394 | ppl 2.63 | wps 27482 | ups 0 | wpb 281028.659 | bsz 8438.088 | num_updates 6780 | lr 0.000768095 | gnorm 0.327 | clip 0.000 | oom 0.000 | wall 78022 | train_wall 67997
| epoch 030 | valid on 'valid' subset | loss 4.174 | nll_loss 2.402 | ppl 5.29 | num_updates 6780 | best_loss 3.97833
| saved checkpoint ckpts/ende_xlm_translation/checkpoint30.pt (epoch 30 @ 6780 updates) (writing took 242.24779176712036 seconds)
| epoch 031 | loss 3.209 | nll_loss 1.379 | ppl 2.60 | wps 27462 | ups 0 | wpb 281028.659 | bsz 8438.088 | num_updates 7006 | lr 0.000755605 | gnorm 0.341 | clip 0.000 | oom 0.000 | wall 80580 | train_wall 70261
| epoch 031 | valid on 'valid' subset | loss 4.085 | nll_loss 2.289 | ppl 4.89 | num_updates 7006 | best_loss 3.97833
| saved checkpoint ckpts/ende_xlm_translation/checkpoint31.pt (epoch 31 @ 7006 updates) (writing took 187.35374188423157 seconds)
| epoch 032 | loss 3.196 | nll_loss 1.365 | ppl 2.58 | wps 27468 | ups 0 | wpb 281028.659 | bsz 8438.088 | num_updates 7232 | lr 0.000743705 | gnorm 0.332 | clip 0.000 | oom 0.000 | wall 83083 | train_wall 72525
| epoch 032 | valid on 'valid' subset | loss 4.103 | nll_loss 2.327 | ppl 5.02 | num_updates 7232 | best_loss 3.97833
| saved checkpoint ckpts/ende_xlm_translation/checkpoint32.pt (epoch 32 @ 7232 updates) (writing took 251.0625023841858 seconds)
| epoch 033 | loss 3.184 | nll_loss 1.351 | ppl 2.55 | wps 27466 | ups 0 | wpb 281028.659 | bsz 8438.088 | num_updates 7458 | lr 0.00073235 | gnorm 0.326 | clip 0.000 | oom 0.000 | wall 85650 | train_wall 74790
| epoch 033 | valid on 'valid' subset | loss 4.116 | nll_loss 2.331 | ppl 5.03 | num_updates 7458 | best_loss 3.97833
| saved checkpoint ckpts/ende_xlm_translation/checkpoint33.pt (epoch 33 @ 7458 updates) (writing took 319.84830474853516 seconds)
| epoch 034 | loss 3.172 | nll_loss 1.338 | ppl 2.53 | wps 27446 | ups 0 | wpb 281028.659 | bsz 8438.088 | num_updates 7684 | lr 0.0007215 | gnorm 0.330 | clip 0.000 | oom 0.000 | wall 88287 | train_wall 77055
| epoch 034 | valid on 'valid' subset | loss 4.025 | nll_loss 2.236 | ppl 4.71 | num_updates 7684 | best_loss 3.97833
| saved checkpoint ckpts/ende_xlm_translation/checkpoint34.pt (epoch 34 @ 7684 updates) (writing took 354.9349000453949 seconds)
| epoch 035 | loss 3.160 | nll_loss 1.324 | ppl 2.50 | wps 27411 | ups 0 | wpb 281028.659 | bsz 8438.088 | num_updates 7910 | lr 0.000711118 | gnorm 0.332 | clip 0.000 | oom 0.000 | wall 90962 | train_wall 79324
| epoch 035 | valid on 'valid' subset | loss 4.093 | nll_loss 2.307 | ppl 4.95 | num_updates 7910 | best_loss 3.97833
| saved checkpoint ckpts/ende_xlm_translation/checkpoint35.pt (epoch 35 @ 7910 updates) (writing took 275.89839005470276 seconds)
| epoch 036 | loss 3.149 | nll_loss 1.312 | ppl 2.48 | wps 27457 | ups 0 | wpb 281028.659 | bsz 8438.088 | num_updates 8136 | lr 0.000701172 | gnorm 0.333 | clip 0.000 | oom 0.000 | wall 93555 | train_wall 81588
| epoch 036 | valid on 'valid' subset | loss 4.098 | nll_loss 2.317 | ppl 4.98 | num_updates 8136 | best_loss 3.97833
/home/user/miniconda2/envs/torch/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 16 leaked semaphores to clean up at shutdown
  len(cache))
XiaoqingNLP commented 5 years ago

Thank you for you kindness help and I have solved this problem.