facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.53k stars 6.41k forks source link

Is my training routine normal? #940

Closed jiezhangGt closed 5 years ago

jiezhangGt commented 5 years ago

Hello, I'm a newcomer to NLP. I have tried to install cuda, cudnn, NCCL and pytorch myself, but I don't know if my training process is normal. Here is my training log:

| [src] dictionary: 40000 types
| [tgt] dictionary: 50000 types
| ./data/train_data train 10000000 examples
| ./data/train_data valid 3000 examples
| model transformer_wmt_en_de, criterion LabelSmoothedCrossEntropyCriterion
| num. model params: 115818496
| training on 4 GPUs
| max tokens per GPU = 2048 and max sentences per GPU = 2000
| WARNING: overflow detected, setting loss scale to: 64.0
| WARNING: overflow detected, setting loss scale to: 32.0
| WARNING: overflow detected, setting loss scale to: 16.0
| WARNING: overflow detected, setting loss scale to: 8.0
| epoch 001:     50 / 9199 loss=15.519, nll_loss=15.428, ppl=44083.35, wps=12245, ups=0.6, wpb=21134, bsz=1042, num_updates=47, lr=5.97383e-06, gnorm=4.899, clip=0%, oom=0, loss_scale=8.000, wall=81, train_wall=15
| epoch 001:    100 / 9199 loss=14.736, nll_loss=14.554, ppl=24057.37, wps=21332, ups=1.0, wpb=21317, bsz=1078, num_updates=97, lr=1.22226e-05, gnorm=3.181, clip=0%, oom=0, loss_scale=8.000, wall=97, train_wall=29
| epoch 001:    150 / 9199 loss=14.285, nll_loss=14.053, ppl=17000.59, wps=27956, ups=1.3, wpb=21369, bsz=1083, num_updates=147, lr=1.84713e-05, gnorm=2.565, clip=0%, oom=0, loss_scale=8.000, wall=112, train_wall=43
| epoch 001:    200 / 9199 loss=13.904, nll_loss=13.630, ppl=12680.30, wps=32875, ups=1.5, wpb=21439, bsz=1085, num_updates=197, lr=2.47201e-05, gnorm=2.289, clip=0%, oom=0, loss_scale=8.000, wall=128, train_wall=57
| epoch 001:    250 / 9199 loss=13.551, nll_loss=13.238, ppl=9658.64, wps=36651, ups=1.7, wpb=21473, bsz=1090, num_updates=247, lr=3.09688e-05, gnorm=2.186, clip=0%, oom=0, loss_scale=8.000, wall=145, train_wall=71
| epoch 001:    300 / 9199 loss=13.227, nll_loss=12.872, ppl=7495.85, wps=39749, ups=1.8, wpb=21536, bsz=1093, num_updates=297, lr=3.72176e-05, gnorm=2.070, clip=0%, oom=0, loss_scale=8.000, wall=161, train_wall=86
| epoch 001:    350 / 9199 loss=12.948, nll_loss=12.554, ppl=6015.14, wps=42313, ups=2.0, wpb=21539, bsz=1097, num_updates=347, lr=4.34663e-05, gnorm=1.959, clip=0%, oom=0, loss_scale=8.000, wall=177, train_wall=100
| epoch 001:    400 / 9199 loss=12.706, nll_loss=12.276, ppl=4960.26, wps=44491, ups=2.1, wpb=21546, bsz=1100, num_updates=397, lr=4.97151e-05, gnorm=1.907, clip=0%, oom=0, loss_scale=8.000, wall=192, train_wall=114
| epoch 001:    450 / 9199 loss=12.507, nll_loss=12.045, ppl=4226.99, wps=46169, ups=2.1, wpb=21487, bsz=1093, num_updates=447, lr=5.59638e-05, gnorm=1.836, clip=0%, oom=0, loss_scale=8.000, wall=208, train_wall=128
| epoch 001:    500 / 9199 loss=12.322, nll_loss=11.830, ppl=3640.19, wps=47629, ups=2.2, wpb=21497, bsz=1095, num_updates=497, lr=6.22126e-05, gnorm=1.774, clip=0%, oom=0, loss_scale=8.000, wall=224, train_wall=143
| epoch 001:    550 / 9199 loss=12.156, nll_loss=11.637, ppl=3185.81, wps=48939, ups=2.3, wpb=21487, bsz=1093, num_updates=547, lr=6.84613e-05, gnorm=1.741, clip=0%, oom=0, loss_scale=8.000, wall=240, train_wall=157
| epoch 001:    600 / 9199 loss=12.012, nll_loss=11.469, ppl=2835.67, wps=49972, ups=2.3, wpb=21460, bsz=1090, num_updates=597, lr=7.47101e-05, gnorm=1.707, clip=0%, oom=0, loss_scale=8.000, wall=256, train_wall=171
| epoch 001:    650 / 9199 loss=11.878, nll_loss=11.313, ppl=2543.67, wps=50990, ups=2.4, wpb=21440, bsz=1087, num_updates=647, lr=8.09588e-05, gnorm=1.670, clip=0%, oom=0, loss_scale=8.000, wall=272, train_wall=185
| epoch 001:    700 / 9199 loss=11.756, nll_loss=11.170, ppl=2304.89, wps=51943, ups=2.4, wpb=21428, bsz=1084, num_updates=697, lr=8.72076e-05, gnorm=1.646, clip=0%, oom=0, loss_scale=8.000, wall=288, train_wall=199
| epoch 001:    750 / 9199 loss=11.642, nll_loss=11.038, ppl=2102.54, wps=52763, ups=2.5, wpb=21417, bsz=1081, num_updates=747, lr=9.34563e-05, gnorm=1.623, clip=0%, oom=0, loss_scale=8.000, wall=303, train_wall=214
| epoch 001:    800 / 9199 loss=11.535, nll_loss=10.913, ppl=1928.65, wps=53370, ups=2.5, wpb=21403, bsz=1081, num_updates=797, lr=9.97051e-05, gnorm=1.607, clip=0%, oom=0, loss_scale=8.000, wall=320, train_wall=229
| epoch 001:    850 / 9199 loss=11.434, nll_loss=10.795, ppl=1776.92, wps=54017, ups=2.5, wpb=21407, bsz=1084, num_updates=847, lr=0.000105954, gnorm=1.597, clip=0%, oom=0, loss_scale=8.000, wall=336, train_wall=243
| epoch 001:    900 / 9199 loss=11.339, nll_loss=10.685, ppl=1646.52, wps=54628, ups=2.6, wpb=21397, bsz=1082, num_updates=897, lr=0.000112203, gnorm=1.575, clip=0%, oom=0, loss_scale=8.000, wall=351, train_wall=257
| epoch 001:    950 / 9199 loss=11.249, nll_loss=10.579, ppl=1530.19, wps=55213, ups=2.6, wpb=21403, bsz=1083, num_updates=947, lr=0.000118451, gnorm=1.560, clip=0%, oom=0, loss_scale=8.000, wall=367, train_wall=271
| epoch 001:   1000 / 9199 loss=11.161, nll_loss=10.477, ppl=1425.51, wps=55661, ups=2.6, wpb=21414, bsz=1083, num_updates=997, lr=0.0001247, gnorm=1.538, clip=0%, oom=0, loss_scale=8.000, wall=384, train_wall=286
| epoch 001 | valid on 'valid' subset | valid_loss 8.36527 | valid_nll_loss 7.22753 | valid_ppl 149.87 | num_updates 1000
| epoch 001:   1050 / 9199 loss=11.080, nll_loss=10.384, ppl=1335.81, wps=55444, ups=2.6, wpb=21405, bsz=1079, num_updates=1047, lr=0.000130949, gnorm=1.521, clip=0%, oom=0, loss_scale=8.000, wall=404, train_wall=300
| epoch 001:   1100 / 9199 loss=11.001, nll_loss=10.291, ppl=1252.64, wps=55914, ups=2.6, wpb=21418, bsz=1079, num_updates=1097, lr=0.000137198, gnorm=1.499, clip=0%, oom=0, loss_scale=8.000, wall=420, train_wall=315
| epoch 001:   1150 / 9199 loss=10.923, nll_loss=10.201, ppl=1176.73, wps=56419, ups=2.6, wpb=21443, bsz=1080, num_updates=1147, lr=0.000143446, gnorm=1.484, clip=0%, oom=0, loss_scale=8.000, wall=436, train_wall=329

I found that the parameter wps is about 50K, is this normal? I am worried that the installation of cudnn or NCCL has problems, which leads to slow training speed.

myleott commented 5 years ago

What kind of GPU? I see you are using 4, are they all on the same machine? What version of PyTorch, NCCL, cuDNN, etc?

myleott commented 5 years ago

Also, please try removing the --fp16 flag and see what the wps is. It should be 2-3x slower. If it is instead faster, then maybe something is wrong.

jiezhangGt commented 5 years ago

What kind of GPU? I see you are using 4, are they all on the same machine? What version of PyTorch, NCCL, cuDNN, etc?

The kind of GPU is Tesla V100 , PyTorch is 1.1.0, NCCL is 2.x, cudn10.0 and cuDNN is 7.5.x. And the 4 GPUs are on the same machine.

myleott commented 5 years ago

Hmm, I get much faster speeds with a similar setup. At least 160k wps on 4 x V100 with --fp16 and more than ~73k wps without --fp16.

This is with the transformer_wmt_en_de architecture and a vocabulary size of 30K, so I have only 61M parameters compared to your 116M. Actually, why do you have 116M parameters? Can you share the full command you ran?

Also what is your performance without --fp16?

myleott commented 5 years ago

Oh also you should increase --max-tokens to make sure you're using all available GPU memory. You should be able to easily use 3500 or 4000.

jiezhangGt commented 5 years ago

Hmm, I get much faster speeds with a similar setup. At least 160k wps on 4 x V100 with --fp16 and more than ~73k wps without --fp16.

This is with the transformer_wmt_en_de architecture and a vocabulary size of 30K, so I have only 61M parameters compared to your 116M. Actually, why do you have 116M parameters? Can you share the full command you ran?

Also what is your performance without --fp16?

Sorry for the late reply. I just reconfigured the parameters and ran the experiment, Here is my full command:

#!/usr/bin/bash

BIN=0_code/fairseq0.6.0/fairseq-0.6.0

saveDir=./checkpoints
ARCH='transformer_wmt_en_de_big'

export CUDA_VISIBLE_DEVICES=0,1,2,3
python $BIN/train.py ./data/train_data --save-dir $saveDir \
                                        --source-lang src --target-lang tgt \
                                        --arch $ARCH \
                                        --max-tokens 3200 --max-sentences 2000 --max-update 500000 --save-interval-updates 600 --log-interval 50 --update-freq 12 \
                                        --lr-scheduler 'inverse_sqrt' --learning-rate 0.0008 --min-lr 1e-10 \
                                        --warmup-updates 4000 --warmup-init-lr 1e-7 \
                                        --criterion 'label_smoothed_cross_entropy' --label-smoothing 0.1 \
                                        --optimizer 'adam' --adam-betas '(0.9, 0.997)' --fp16 >log 2>&1 &

And The corresponding log is shown below:

| distributed init (rank 3): tcp://localhost:18255
| distributed init (rank 2): tcp://localhost:18255
| distributed init (rank 0): tcp://localhost:18255
| distributed init (rank 1): tcp://localhost:18255
Namespace(adam_betas='(0.9, 0.997)', adam_eps=1e-08, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, arch='transformer_wmt_en_de_big', attention_dropout=0.1, bucket_cap_mb=150, clip_norm=25, criterion='label_smoothed_cross_entropy', data=['./data/train_data'], ddp_backend='no_c10d', decoder_attention_heads=16, decoder_embed_dim=1024, decoder_embed_path=None, decoder_ffn_embed_dim=4096, decoder_input_dim=1024, decoder_layers=6, decoder_learned_pos=False, decoder_normalize_before=False, decoder_output_dim=1024, device_id=0, distributed_backend='nccl', distributed_init_host='localhost', distributed_init_method='tcp://localhost:18255', distributed_port=18256, distributed_rank=0, distributed_world_size=4, dropout=0.3, encoder_attention_heads=16, encoder_embed_dim=1024, encoder_embed_path=None, encoder_ffn_embed_dim=4096, encoder_layers=6, encoder_learned_pos=False, encoder_normalize_before=False, fp16=True, fp16_init_scale=128, keep_interval_updates=-1, label_smoothing=0.1, left_pad_source='True', left_pad_target='False', log_format=None, log_interval=50, lr=[0.0008], lr_scheduler='inverse_sqrt', lr_shrink=0.1, max_epoch=0, max_sentences=2000, max_sentences_valid=2000, max_source_positions=1024, max_target_positions=1024, max_tokens=3200, max_update=500000, min_loss_scale=0.0001, min_lr=1e-10, momentum=0.99, no_epoch_checkpoints=False, no_progress_bar=False, no_save=False, no_token_positional_embeddings=False, optimizer='adam', optimizer_overrides='{}', raw_text=False, relu_dropout=0.0, reset_lr_scheduler=False, reset_optimizer=False, restore_file='checkpoint_last.pt', save_dir='./checkpoints', save_interval=1, save_interval_updates=600, seed=1, sentence_avg=False, share_all_embeddings=False, share_decoder_input_output_embed=False, skip_invalid_size_inputs_valid_test=False, source_lang='src', target_lang='tgt', task='translation', train_subset='train', update_freq=[12], upsample_primary=1, valid_subset='valid', validate_interval=1, warmup_init_lr=1e-07, warmup_updates=4000, weight_decay=0.0)
| [src] dictionary: 40000 types
| [tgt] dictionary: 50000 types
| ./data/train_data train 10000000 examples
| ./data/train_data valid 3000 examples
| model transformer_wmt_en_de_big, criterion LabelSmoothedCrossEntropyCriterion
| num. model params: 319717376
| training on 4 GPUs
| max tokens per GPU = 3200 and max sentences per GPU = 2000
| WARNING: overflow detected, setting loss scale to: 64.0
| WARNING: overflow detected, setting loss scale to: 32.0
| WARNING: overflow detected, setting loss scale to: 16.0
| WARNING: overflow detected, setting loss scale to: 8.0
| WARNING: overflow detected, setting loss scale to: 4.0
| epoch 001:     50 / 1891 loss=14.725, nll_loss=14.537, ppl=23770.85, wps=30542, ups=0.3, wpb=104240, bsz=5162, num_updates=46, lr=9.29885e-06, gnorm=3.779, clip=0%, oom=0, loss_scale=4.000, wall=157, train_wall=85
| epoch 001:    100 / 1891 loss=13.846, nll_loss=13.556, ppl=12040.66, wps=40539, ups=0.4, wpb=104634, bsz=5265, num_updates=96, lr=1.92976e-05, gnorm=2.327, clip=0%, oom=0, loss_scale=4.000, wall=248, train_wall=169
| WARNING: overflow detected, setting loss scale to: 2.0
| epoch 001:    150 / 1891 loss=13.259, nll_loss=12.897, ppl=7625.64, wps=44802, ups=0.4, wpb=104511, bsz=5286, num_updates=145, lr=2.90964e-05, gnorm=2.039, clip=0%, oom=0, loss_scale=2.000, wall=338, train_wall=253
| epoch 001:    200 / 1891 loss=12.794, nll_loss=12.365, ppl=5275.67, wps=47660, ups=0.5, wpb=104699, bsz=5296, num_updates=195, lr=3.90951e-05, gnorm=1.761, clip=0%, oom=0, loss_scale=2.000, wall=428, train_wall=336
| epoch 001:    250 / 1891 loss=12.446, nll_loss=11.960, ppl=3984.58, wps=49475, ups=0.5, wpb=104921, bsz=5333, num_updates=245, lr=4.90939e-05, gnorm=1.688, clip=0%, oom=0, loss_scale=2.000, wall=520, train_wall=420
| epoch 001:    300 / 1891 loss=12.168, nll_loss=11.636, ppl=3182.21, wps=50758, ups=0.5, wpb=105043, bsz=5345, num_updates=295, lr=5.90926e-05, gnorm=1.665, clip=0%, oom=0, loss_scale=2.000, wall=610, train_wall=504
| epoch 001:    350 / 1891 loss=11.925, nll_loss=11.353, ppl=2615.31, wps=51693, ups=0.5, wpb=104987, bsz=5334, num_updates=345, lr=6.90914e-05, gnorm=1.586, clip=0%, oom=0, loss_scale=2.000, wall=701, train_wall=587
| epoch 001:    400 / 1891 loss=11.712, nll_loss=11.103, ppl=2200.13, wps=52458, ups=0.5, wpb=105043, bsz=5332, num_updates=395, lr=7.90901e-05, gnorm=1.552, clip=0%, oom=0, loss_scale=2.000, wall=791, train_wall=671
| epoch 001:    450 / 1891 loss=11.521, nll_loss=10.881, ppl=1886.05, wps=52922, ups=0.5, wpb=104928, bsz=5325, num_updates=445, lr=8.90889e-05, gnorm=1.543, clip=0%, oom=0, loss_scale=2.000, wall=882, train_wall=756
| epoch 001:    500 / 1891 loss=11.346, nll_loss=10.678, ppl=1637.83, wps=53351, ups=0.5, wpb=104872, bsz=5322, num_updates=495, lr=9.90876e-05, gnorm=1.507, clip=0%, oom=0, loss_scale=2.000, wall=973, train_wall=840
| epoch 001:    550 / 1891 loss=11.186, nll_loss=10.492, ppl=1440.01, wps=53627, ups=0.5, wpb=104800, bsz=5311, num_updates=545, lr=0.000109086, gnorm=1.469, clip=0%, oom=0, loss_scale=2.000, wall=1065, train_wall=925
| epoch 001:    600 / 1891 loss=11.039, nll_loss=10.320, ppl=1278.29, wps=53936, ups=0.5, wpb=104812, bsz=5310, num_updates=595, lr=0.000119085, gnorm=1.445, clip=0%, oom=0, loss_scale=2.000, wall=1156, train_wall=1010
| epoch 001 | valid on 'valid' subset | valid_loss 8.11425 | valid_nll_loss 6.8834 | valid_ppl 118.06 | num_updates 600
| epoch 001:    650 / 1891 loss=10.902, nll_loss=10.160, ppl=1144.39, wps=53573, ups=0.5, wpb=104904, bsz=5328, num_updates=645, lr=0.000129084, gnorm=1.419, clip=0%, oom=0, loss_scale=2.000, wall=1263, train_wall=1094
| epoch 001:    700 / 1891 loss=10.776, nll_loss=10.015, ppl=1034.42, wps=53814, ups=0.5, wpb=104859, bsz=5329, num_updates=695, lr=0.000139083, gnorm=1.390, clip=0%, oom=0, loss_scale=2.000, wall=1354, train_wall=1178
| epoch 001:    750 / 1891 loss=10.660, nll_loss=9.878, ppl=941.27, wps=54048, ups=0.5, wpb=104865, bsz=5337, num_updates=745, lr=0.000149081, gnorm=1.363, clip=0%, oom=0, loss_scale=2.000, wall=1445, train_wall=1263
| epoch 001:    800 / 1891 loss=10.553, nll_loss=9.754, ppl=863.57, wps=54254, ups=0.5, wpb=104826, bsz=5328, num_updates=795, lr=0.00015908, gnorm=1.346, clip=0%, oom=0, loss_scale=2.000, wall=1536, train_wall=1347
| epoch 001:    850 / 1891 loss=10.453, nll_loss=9.637, ppl=796.42, wps=54409, ups=0.5, wpb=104781, bsz=5324, num_updates=845, lr=0.000169079, gnorm=1.328, clip=0%, oom=0, loss_scale=2.000, wall=1627, train_wall=1432
| epoch 001:    900 / 1891 loss=10.358, nll_loss=9.526, ppl=737.49, wps=54558, ups=0.5, wpb=104751, bsz=5324, num_updates=895, lr=0.000179078, gnorm=1.311, clip=0%, oom=0, loss_scale=2.000, wall=1718, train_wall=1516
| epoch 001:    950 / 1891 loss=10.266, nll_loss=9.420, ppl=684.83, wps=54705, ups=0.5, wpb=104724, bsz=5318, num_updates=945, lr=0.000189076, gnorm=1.294, clip=0%, oom=0, loss_scale=2.000, wall=1809, train_wall=1600
| epoch 001:   1000 / 1891 loss=10.176, nll_loss=9.315, ppl=636.99, wps=54872, ups=0.5, wpb=104771, bsz=5320, num_updates=995, lr=0.000199075, gnorm=1.282, clip=0%, oom=0, loss_scale=2.000, wall=1900, train_wall=1685
| epoch 001:   1050 / 1891 loss=10.092, nll_loss=9.217, ppl=595.13, wps=54999, ups=0.5, wpb=104755, bsz=5315, num_updates=1045, lr=0.000209074, gnorm=1.266, clip=0%, oom=0, loss_scale=2.000, wall=1990, train_wall=1769
| epoch 001:   1100 / 1891 loss=10.009, nll_loss=9.120, ppl=556.45, wps=55135, ups=0.5, wpb=104752, bsz=5317, num_updates=1095, lr=0.000219073, gnorm=1.248, clip=0%, oom=0, loss_scale=2.000, wall=2080, train_wall=1852
| epoch 001:   1150 / 1891 loss=9.932, nll_loss=9.030, ppl=522.84, wps=55246, ups=0.5, wpb=104720, bsz=5312, num_updates=1145, lr=0.000229071, gnorm=1.239, clip=0%, oom=0, loss_scale=2.000, wall=2170, train_wall=1936
| epoch 001:   1200 / 1891 loss=9.854, nll_loss=8.940, ppl=490.98, wps=55350, ups=0.5, wpb=104694, bsz=5314, num_updates=1195, lr=0.00023907, gnorm=1.223, clip=0%, oom=0, loss_scale=2.000, wall=2260, train_wall=2019
| epoch 001 | valid on 'valid' subset | valid_loss 6.13031 | valid_nll_loss 4.55506 | valid_ppl 23.51 | num_updates 1200 | best 6.13031
| epoch 001:   1250 / 1891 loss=9.777, nll_loss=8.850, ppl=461.39, wps=54739, ups=0.5, wpb=104729, bsz=5311, num_updates=1245, lr=0.000249069, gnorm=1.208, clip=0%, oom=0, loss_scale=2.000, wall=2382, train_wall=2102
| epoch 001:   1300 / 1891 loss=9.703, nll_loss=8.765, ppl=434.95, wps=54855, ups=0.5, wpb=104713, bsz=5309, num_updates=1295, lr=0.000259068, gnorm=1.195, clip=0%, oom=0, loss_scale=2.000, wall=2472, train_wall=2185
| epoch 001:   1350 / 1891 loss=9.632, nll_loss=8.682, ppl=410.76, wps=54968, ups=0.5, wpb=104688, bsz=5306, num_updates=1345, lr=0.000269066, gnorm=1.182, clip=0%, oom=0, loss_scale=2.000, wall=2562, train_wall=2268
| epoch 001:   1400 / 1891 loss=9.561, nll_loss=8.600, ppl=387.90, wps=55078, ups=0.5, wpb=104667, bsz=5303, num_updates=1395, lr=0.000279065, gnorm=1.168, clip=0%, oom=0, loss_scale=2.000, wall=2651, train_wall=2351
| epoch 001:   1450 / 1891 loss=9.492, nll_loss=8.519, ppl=366.74, wps=55184, ups=0.5, wpb=104681, bsz=5300, num_updates=1445, lr=0.000289064, gnorm=1.152, clip=0%, oom=0, loss_scale=2.000, wall=2741, train_wall=2434
| epoch 001:   1500 / 1891 loss=9.424, nll_loss=8.440, ppl=347.24, wps=55289, ups=0.5, wpb=104675, bsz=5294, num_updates=1495, lr=0.000299063, gnorm=1.139, clip=0%, oom=0, loss_scale=2.000, wall=2830, train_wall=2517
| epoch 001:   1550 / 1891 loss=9.357, nll_loss=8.362, ppl=329.02, wps=55363, ups=0.5, wpb=104670, bsz=5295, num_updates=1545, lr=0.000309061, gnorm=1.126, clip=0%, oom=0, loss_scale=2.000, wall=2921, train_wall=2601
| epoch 001:   1600 / 1891 loss=9.291, nll_loss=8.286, ppl=312.02, wps=55436, ups=0.5, wpb=104644, bsz=5296, num_updates=1595, lr=0.00031906, gnorm=1.114, clip=0%, oom=0, loss_scale=2.000, wall=3011, train_wall=2684
| epoch 001:   1650 / 1891 loss=9.225, nll_loss=8.209, ppl=295.85, wps=55519, ups=0.5, wpb=104655, bsz=5295, num_updates=1645, lr=0.000329059, gnorm=1.101, clip=0%, oom=0, loss_scale=2.000, wall=3101, train_wall=2767

I have tried removed the --fp16 flag and here is the log:

| distributed init (rank 2): tcp://localhost:12653
| distributed init (rank 0): tcp://localhost:12653
| distributed init (rank 1): tcp://localhost:12653
| distributed init (rank 3): tcp://localhost:12653
Namespace(adam_betas='(0.9, 0.997)', adam_eps=1e-08, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, arch='transformer_wmt_en_de_big', attention_dropout=0.1, bucket_cap_mb=150, clip_norm=25, criterion='label_smoothed_cross_entropy', data=['./data/train_data'], ddp_backend='no_c10d', decoder_attention_heads=16, decoder_embed_dim=1024, decoder_embed_path=None, decoder_ffn_embed_dim=4096, decoder_input_dim=1024, decoder_layers=6, decoder_learned_pos=False, decoder_normalize_before=False, decoder_output_dim=1024, device_id=0, distributed_backend='nccl', distributed_init_host='localhost', distributed_init_method='tcp://localhost:12653', distributed_port=12654, distributed_rank=0, distributed_world_size=4, dropout=0.3, encoder_attention_heads=16, encoder_embed_dim=1024, encoder_embed_path=None, encoder_ffn_embed_dim=4096, encoder_layers=6, encoder_learned_pos=False, encoder_normalize_before=False, fp16=False, fp16_init_scale=128, keep_interval_updates=-1, label_smoothing=0.1, left_pad_source='True', left_pad_target='False', log_format=None, log_interval=50, lr=[0.0008], lr_scheduler='inverse_sqrt', lr_shrink=0.1, max_epoch=0, max_sentences=2000, max_sentences_valid=2000, max_source_positions=1024, max_target_positions=1024, max_tokens=3200, max_update=500000, min_loss_scale=0.0001, min_lr=1e-10, momentum=0.99, no_epoch_checkpoints=False, no_progress_bar=False, no_save=False, no_token_positional_embeddings=False, optimizer='adam', optimizer_overrides='{}', raw_text=False, relu_dropout=0.0, reset_lr_scheduler=False, reset_optimizer=False, restore_file='checkpoint_last.pt', save_dir='./checkpoints', save_interval=1, save_interval_updates=600, seed=1, sentence_avg=False, share_all_embeddings=False, share_decoder_input_output_embed=False, skip_invalid_size_inputs_valid_test=False, source_lang='src', target_lang='tgt', task='translation', train_subset='train', update_freq=[12], upsample_primary=1, valid_subset='valid', validate_interval=1, warmup_init_lr=1e-07, warmup_updates=4000, weight_decay=0.0)
| [src] dictionary: 40000 types
| [tgt] dictionary: 50000 types
| ./data/train_data train 10000000 examples
| ./data/train_data valid 3000 examples
| model transformer_wmt_en_de_big, criterion LabelSmoothedCrossEntropyCriterion
| num. model params: 319717376
| training on 4 GPUs
| max tokens per GPU = 3200 and max sentences per GPU = 2000
| NOTICE: your device may support faster training with --fp16
| epoch 001:     50 / 1891 loss=14.582, nll_loss=14.378, ppl=21286.60, wps=20621, ups=0.2, wpb=104342, bsz=5187, num_updates=51, lr=1.02987e-05, gnorm=3.509, clip=0%, oom=0, wall=324, train_wall=251
| epoch 001:    100 / 1891 loss=13.768, nll_loss=13.468, ppl=11333.83, wps=20624, ups=0.2, wpb=104666, bsz=5273, num_updates=101, lr=2.02975e-05, gnorm=2.250, clip=0%, oom=0, wall=579, train_wall=499
| epoch 001:    150 / 1891 loss=13.181, nll_loss=12.809, ppl=7176.27, wps=20597, ups=0.2, wpb=104571, bsz=5298, num_updates=151, lr=3.02962e-05, gnorm=1.971, clip=0%, oom=0, wall=833, train_wall=746
| epoch 001:    200 / 1891 loss=12.734, nll_loss=12.295, ppl=5026.19, wps=20624, ups=0.2, wpb=104738, bsz=5305, num_updates=201, lr=4.0295e-05, gnorm=1.803, clip=0%, oom=0, wall=1087, train_wall=993
| epoch 001:    250 / 1891 loss=12.396, nll_loss=11.901, ppl=3825.51, wps=20636, ups=0.2, wpb=104947, bsz=5339, num_updates=251, lr=5.02937e-05, gnorm=1.784, clip=0%, oom=0, wall=1343, train_wall=1241
| epoch 001:    300 / 1891 loss=12.121, nll_loss=11.581, ppl=3063.27, wps=20655, ups=0.2, wpb=105062, bsz=5350, num_updates=301, lr=6.02925e-05, gnorm=1.708, clip=0%, oom=0, wall=1597, train_wall=1489
| epoch 001:    350 / 1891 loss=11.881, nll_loss=11.302, ppl=2524.68, wps=20650, ups=0.2, wpb=105005, bsz=5339, num_updates=351, lr=7.02912e-05, gnorm=1.629, clip=0%, oom=0, wall=1851, train_wall=1735
| epoch 001:    400 / 1891 loss=11.669, nll_loss=11.055, ppl=2127.60, wps=20665, ups=0.2, wpb=105057, bsz=5336, num_updates=401, lr=8.029e-05, gnorm=1.608, clip=0%, oom=0, wall=2105, train_wall=1982
| epoch 001:    450 / 1891 loss=11.480, nll_loss=10.834, ppl=1825.88, wps=20640, ups=0.2, wpb=104942, bsz=5329, num_updates=451, lr=9.02887e-05, gnorm=1.564, clip=0%, oom=0, wall=2359, train_wall=2230
| epoch 001:    500 / 1891 loss=11.308, nll_loss=10.634, ppl=1589.20, wps=20630, ups=0.2, wpb=104885, bsz=5325, num_updates=501, lr=0.000100287, gnorm=1.540, clip=0%, oom=0, wall=2613, train_wall=2477
| epoch 001:    550 / 1891 loss=11.151, nll_loss=10.451, ppl=1400.25, wps=20620, ups=0.2, wpb=104813, bsz=5314, num_updates=551, lr=0.000110286, gnorm=1.501, clip=0%, oom=0, wall=2867, train_wall=2724
| epoch 001 | valid on 'valid' subset | valid_loss 8.07049 | valid_nll_loss 6.84806 | valid_ppl 115.21 | num_updates 600

what's more, the GPU's type is Tesla V100-SXM2. and it's memory is 16G.

Look from the wps, I'm worried about my training speed

myleott commented 5 years ago

Your latest comment has a different architecture than your first comment, you changed from transformer_wmt_en_de to transformer_wmt_en_de_big. Different architectures will have different training speeds, so it'll be helpful to keep that consistent.

Another thing I forgot to mention is that you should install apex: https://github.com/NVIDIA/apex/. Make sure to installing Apex with the CUDA and C++ extensions. Fairseq will automatically pick them up and it should improve speed by a decent amount.

Also can you upgrade to a newer version of fairseq? I see from the directory name it says 0.6.0.

jiezhangGt commented 5 years ago

Your latest comment has a different architecture than your first comment, you changed from transformer_wmt_en_de to transformer_wmt_en_de_big. Different architectures will have different training speeds, so it'll be helpful to keep that consistent.

Another thing I forgot to mention is that you should install apex: https://github.com/NVIDIA/apex/. Make sure to installing Apex with the CUDA and C++ extensions. Fairseq will automatically pick them up and it should improve speed by a decent amount.

Also can you upgrade to a newer version of fairseq? I see from the directory name it says 0.6.0.

Thank you very much for your advice. According to your suggestion, my problem has been solved perfectly. Thank you once again!