Closed jiezhangGt closed 5 years ago
What kind of GPU? I see you are using 4, are they all on the same machine? What version of PyTorch, NCCL, cuDNN, etc?
Also, please try removing the --fp16
flag and see what the wps is. It should be 2-3x slower. If it is instead faster, then maybe something is wrong.
What kind of GPU? I see you are using 4, are they all on the same machine? What version of PyTorch, NCCL, cuDNN, etc?
The kind of GPU is Tesla V100 , PyTorch is 1.1.0, NCCL is 2.x, cudn10.0 and cuDNN is 7.5.x. And the 4 GPUs are on the same machine.
Hmm, I get much faster speeds with a similar setup. At least 160k wps on 4 x V100 with --fp16
and more than ~73k wps without --fp16
.
This is with the transformer_wmt_en_de architecture and a vocabulary size of 30K, so I have only 61M parameters compared to your 116M. Actually, why do you have 116M parameters? Can you share the full command you ran?
Also what is your performance without --fp16
?
Oh also you should increase --max-tokens
to make sure you're using all available GPU memory. You should be able to easily use 3500 or 4000.
Hmm, I get much faster speeds with a similar setup. At least 160k wps on 4 x V100 with
--fp16
and more than ~73k wps without--fp16
.This is with the transformer_wmt_en_de architecture and a vocabulary size of 30K, so I have only 61M parameters compared to your 116M. Actually, why do you have 116M parameters? Can you share the full command you ran?
Also what is your performance without
--fp16
?
Sorry for the late reply. I just reconfigured the parameters and ran the experiment, Here is my full command:
#!/usr/bin/bash
BIN=0_code/fairseq0.6.0/fairseq-0.6.0
saveDir=./checkpoints
ARCH='transformer_wmt_en_de_big'
export CUDA_VISIBLE_DEVICES=0,1,2,3
python $BIN/train.py ./data/train_data --save-dir $saveDir \
--source-lang src --target-lang tgt \
--arch $ARCH \
--max-tokens 3200 --max-sentences 2000 --max-update 500000 --save-interval-updates 600 --log-interval 50 --update-freq 12 \
--lr-scheduler 'inverse_sqrt' --learning-rate 0.0008 --min-lr 1e-10 \
--warmup-updates 4000 --warmup-init-lr 1e-7 \
--criterion 'label_smoothed_cross_entropy' --label-smoothing 0.1 \
--optimizer 'adam' --adam-betas '(0.9, 0.997)' --fp16 >log 2>&1 &
And The corresponding log is shown below:
| distributed init (rank 3): tcp://localhost:18255
| distributed init (rank 2): tcp://localhost:18255
| distributed init (rank 0): tcp://localhost:18255
| distributed init (rank 1): tcp://localhost:18255
Namespace(adam_betas='(0.9, 0.997)', adam_eps=1e-08, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, arch='transformer_wmt_en_de_big', attention_dropout=0.1, bucket_cap_mb=150, clip_norm=25, criterion='label_smoothed_cross_entropy', data=['./data/train_data'], ddp_backend='no_c10d', decoder_attention_heads=16, decoder_embed_dim=1024, decoder_embed_path=None, decoder_ffn_embed_dim=4096, decoder_input_dim=1024, decoder_layers=6, decoder_learned_pos=False, decoder_normalize_before=False, decoder_output_dim=1024, device_id=0, distributed_backend='nccl', distributed_init_host='localhost', distributed_init_method='tcp://localhost:18255', distributed_port=18256, distributed_rank=0, distributed_world_size=4, dropout=0.3, encoder_attention_heads=16, encoder_embed_dim=1024, encoder_embed_path=None, encoder_ffn_embed_dim=4096, encoder_layers=6, encoder_learned_pos=False, encoder_normalize_before=False, fp16=True, fp16_init_scale=128, keep_interval_updates=-1, label_smoothing=0.1, left_pad_source='True', left_pad_target='False', log_format=None, log_interval=50, lr=[0.0008], lr_scheduler='inverse_sqrt', lr_shrink=0.1, max_epoch=0, max_sentences=2000, max_sentences_valid=2000, max_source_positions=1024, max_target_positions=1024, max_tokens=3200, max_update=500000, min_loss_scale=0.0001, min_lr=1e-10, momentum=0.99, no_epoch_checkpoints=False, no_progress_bar=False, no_save=False, no_token_positional_embeddings=False, optimizer='adam', optimizer_overrides='{}', raw_text=False, relu_dropout=0.0, reset_lr_scheduler=False, reset_optimizer=False, restore_file='checkpoint_last.pt', save_dir='./checkpoints', save_interval=1, save_interval_updates=600, seed=1, sentence_avg=False, share_all_embeddings=False, share_decoder_input_output_embed=False, skip_invalid_size_inputs_valid_test=False, source_lang='src', target_lang='tgt', task='translation', train_subset='train', update_freq=[12], upsample_primary=1, valid_subset='valid', validate_interval=1, warmup_init_lr=1e-07, warmup_updates=4000, weight_decay=0.0)
| [src] dictionary: 40000 types
| [tgt] dictionary: 50000 types
| ./data/train_data train 10000000 examples
| ./data/train_data valid 3000 examples
| model transformer_wmt_en_de_big, criterion LabelSmoothedCrossEntropyCriterion
| num. model params: 319717376
| training on 4 GPUs
| max tokens per GPU = 3200 and max sentences per GPU = 2000
| WARNING: overflow detected, setting loss scale to: 64.0
| WARNING: overflow detected, setting loss scale to: 32.0
| WARNING: overflow detected, setting loss scale to: 16.0
| WARNING: overflow detected, setting loss scale to: 8.0
| WARNING: overflow detected, setting loss scale to: 4.0
| epoch 001: 50 / 1891 loss=14.725, nll_loss=14.537, ppl=23770.85, wps=30542, ups=0.3, wpb=104240, bsz=5162, num_updates=46, lr=9.29885e-06, gnorm=3.779, clip=0%, oom=0, loss_scale=4.000, wall=157, train_wall=85
| epoch 001: 100 / 1891 loss=13.846, nll_loss=13.556, ppl=12040.66, wps=40539, ups=0.4, wpb=104634, bsz=5265, num_updates=96, lr=1.92976e-05, gnorm=2.327, clip=0%, oom=0, loss_scale=4.000, wall=248, train_wall=169
| WARNING: overflow detected, setting loss scale to: 2.0
| epoch 001: 150 / 1891 loss=13.259, nll_loss=12.897, ppl=7625.64, wps=44802, ups=0.4, wpb=104511, bsz=5286, num_updates=145, lr=2.90964e-05, gnorm=2.039, clip=0%, oom=0, loss_scale=2.000, wall=338, train_wall=253
| epoch 001: 200 / 1891 loss=12.794, nll_loss=12.365, ppl=5275.67, wps=47660, ups=0.5, wpb=104699, bsz=5296, num_updates=195, lr=3.90951e-05, gnorm=1.761, clip=0%, oom=0, loss_scale=2.000, wall=428, train_wall=336
| epoch 001: 250 / 1891 loss=12.446, nll_loss=11.960, ppl=3984.58, wps=49475, ups=0.5, wpb=104921, bsz=5333, num_updates=245, lr=4.90939e-05, gnorm=1.688, clip=0%, oom=0, loss_scale=2.000, wall=520, train_wall=420
| epoch 001: 300 / 1891 loss=12.168, nll_loss=11.636, ppl=3182.21, wps=50758, ups=0.5, wpb=105043, bsz=5345, num_updates=295, lr=5.90926e-05, gnorm=1.665, clip=0%, oom=0, loss_scale=2.000, wall=610, train_wall=504
| epoch 001: 350 / 1891 loss=11.925, nll_loss=11.353, ppl=2615.31, wps=51693, ups=0.5, wpb=104987, bsz=5334, num_updates=345, lr=6.90914e-05, gnorm=1.586, clip=0%, oom=0, loss_scale=2.000, wall=701, train_wall=587
| epoch 001: 400 / 1891 loss=11.712, nll_loss=11.103, ppl=2200.13, wps=52458, ups=0.5, wpb=105043, bsz=5332, num_updates=395, lr=7.90901e-05, gnorm=1.552, clip=0%, oom=0, loss_scale=2.000, wall=791, train_wall=671
| epoch 001: 450 / 1891 loss=11.521, nll_loss=10.881, ppl=1886.05, wps=52922, ups=0.5, wpb=104928, bsz=5325, num_updates=445, lr=8.90889e-05, gnorm=1.543, clip=0%, oom=0, loss_scale=2.000, wall=882, train_wall=756
| epoch 001: 500 / 1891 loss=11.346, nll_loss=10.678, ppl=1637.83, wps=53351, ups=0.5, wpb=104872, bsz=5322, num_updates=495, lr=9.90876e-05, gnorm=1.507, clip=0%, oom=0, loss_scale=2.000, wall=973, train_wall=840
| epoch 001: 550 / 1891 loss=11.186, nll_loss=10.492, ppl=1440.01, wps=53627, ups=0.5, wpb=104800, bsz=5311, num_updates=545, lr=0.000109086, gnorm=1.469, clip=0%, oom=0, loss_scale=2.000, wall=1065, train_wall=925
| epoch 001: 600 / 1891 loss=11.039, nll_loss=10.320, ppl=1278.29, wps=53936, ups=0.5, wpb=104812, bsz=5310, num_updates=595, lr=0.000119085, gnorm=1.445, clip=0%, oom=0, loss_scale=2.000, wall=1156, train_wall=1010
| epoch 001 | valid on 'valid' subset | valid_loss 8.11425 | valid_nll_loss 6.8834 | valid_ppl 118.06 | num_updates 600
| epoch 001: 650 / 1891 loss=10.902, nll_loss=10.160, ppl=1144.39, wps=53573, ups=0.5, wpb=104904, bsz=5328, num_updates=645, lr=0.000129084, gnorm=1.419, clip=0%, oom=0, loss_scale=2.000, wall=1263, train_wall=1094
| epoch 001: 700 / 1891 loss=10.776, nll_loss=10.015, ppl=1034.42, wps=53814, ups=0.5, wpb=104859, bsz=5329, num_updates=695, lr=0.000139083, gnorm=1.390, clip=0%, oom=0, loss_scale=2.000, wall=1354, train_wall=1178
| epoch 001: 750 / 1891 loss=10.660, nll_loss=9.878, ppl=941.27, wps=54048, ups=0.5, wpb=104865, bsz=5337, num_updates=745, lr=0.000149081, gnorm=1.363, clip=0%, oom=0, loss_scale=2.000, wall=1445, train_wall=1263
| epoch 001: 800 / 1891 loss=10.553, nll_loss=9.754, ppl=863.57, wps=54254, ups=0.5, wpb=104826, bsz=5328, num_updates=795, lr=0.00015908, gnorm=1.346, clip=0%, oom=0, loss_scale=2.000, wall=1536, train_wall=1347
| epoch 001: 850 / 1891 loss=10.453, nll_loss=9.637, ppl=796.42, wps=54409, ups=0.5, wpb=104781, bsz=5324, num_updates=845, lr=0.000169079, gnorm=1.328, clip=0%, oom=0, loss_scale=2.000, wall=1627, train_wall=1432
| epoch 001: 900 / 1891 loss=10.358, nll_loss=9.526, ppl=737.49, wps=54558, ups=0.5, wpb=104751, bsz=5324, num_updates=895, lr=0.000179078, gnorm=1.311, clip=0%, oom=0, loss_scale=2.000, wall=1718, train_wall=1516
| epoch 001: 950 / 1891 loss=10.266, nll_loss=9.420, ppl=684.83, wps=54705, ups=0.5, wpb=104724, bsz=5318, num_updates=945, lr=0.000189076, gnorm=1.294, clip=0%, oom=0, loss_scale=2.000, wall=1809, train_wall=1600
| epoch 001: 1000 / 1891 loss=10.176, nll_loss=9.315, ppl=636.99, wps=54872, ups=0.5, wpb=104771, bsz=5320, num_updates=995, lr=0.000199075, gnorm=1.282, clip=0%, oom=0, loss_scale=2.000, wall=1900, train_wall=1685
| epoch 001: 1050 / 1891 loss=10.092, nll_loss=9.217, ppl=595.13, wps=54999, ups=0.5, wpb=104755, bsz=5315, num_updates=1045, lr=0.000209074, gnorm=1.266, clip=0%, oom=0, loss_scale=2.000, wall=1990, train_wall=1769
| epoch 001: 1100 / 1891 loss=10.009, nll_loss=9.120, ppl=556.45, wps=55135, ups=0.5, wpb=104752, bsz=5317, num_updates=1095, lr=0.000219073, gnorm=1.248, clip=0%, oom=0, loss_scale=2.000, wall=2080, train_wall=1852
| epoch 001: 1150 / 1891 loss=9.932, nll_loss=9.030, ppl=522.84, wps=55246, ups=0.5, wpb=104720, bsz=5312, num_updates=1145, lr=0.000229071, gnorm=1.239, clip=0%, oom=0, loss_scale=2.000, wall=2170, train_wall=1936
| epoch 001: 1200 / 1891 loss=9.854, nll_loss=8.940, ppl=490.98, wps=55350, ups=0.5, wpb=104694, bsz=5314, num_updates=1195, lr=0.00023907, gnorm=1.223, clip=0%, oom=0, loss_scale=2.000, wall=2260, train_wall=2019
| epoch 001 | valid on 'valid' subset | valid_loss 6.13031 | valid_nll_loss 4.55506 | valid_ppl 23.51 | num_updates 1200 | best 6.13031
| epoch 001: 1250 / 1891 loss=9.777, nll_loss=8.850, ppl=461.39, wps=54739, ups=0.5, wpb=104729, bsz=5311, num_updates=1245, lr=0.000249069, gnorm=1.208, clip=0%, oom=0, loss_scale=2.000, wall=2382, train_wall=2102
| epoch 001: 1300 / 1891 loss=9.703, nll_loss=8.765, ppl=434.95, wps=54855, ups=0.5, wpb=104713, bsz=5309, num_updates=1295, lr=0.000259068, gnorm=1.195, clip=0%, oom=0, loss_scale=2.000, wall=2472, train_wall=2185
| epoch 001: 1350 / 1891 loss=9.632, nll_loss=8.682, ppl=410.76, wps=54968, ups=0.5, wpb=104688, bsz=5306, num_updates=1345, lr=0.000269066, gnorm=1.182, clip=0%, oom=0, loss_scale=2.000, wall=2562, train_wall=2268
| epoch 001: 1400 / 1891 loss=9.561, nll_loss=8.600, ppl=387.90, wps=55078, ups=0.5, wpb=104667, bsz=5303, num_updates=1395, lr=0.000279065, gnorm=1.168, clip=0%, oom=0, loss_scale=2.000, wall=2651, train_wall=2351
| epoch 001: 1450 / 1891 loss=9.492, nll_loss=8.519, ppl=366.74, wps=55184, ups=0.5, wpb=104681, bsz=5300, num_updates=1445, lr=0.000289064, gnorm=1.152, clip=0%, oom=0, loss_scale=2.000, wall=2741, train_wall=2434
| epoch 001: 1500 / 1891 loss=9.424, nll_loss=8.440, ppl=347.24, wps=55289, ups=0.5, wpb=104675, bsz=5294, num_updates=1495, lr=0.000299063, gnorm=1.139, clip=0%, oom=0, loss_scale=2.000, wall=2830, train_wall=2517
| epoch 001: 1550 / 1891 loss=9.357, nll_loss=8.362, ppl=329.02, wps=55363, ups=0.5, wpb=104670, bsz=5295, num_updates=1545, lr=0.000309061, gnorm=1.126, clip=0%, oom=0, loss_scale=2.000, wall=2921, train_wall=2601
| epoch 001: 1600 / 1891 loss=9.291, nll_loss=8.286, ppl=312.02, wps=55436, ups=0.5, wpb=104644, bsz=5296, num_updates=1595, lr=0.00031906, gnorm=1.114, clip=0%, oom=0, loss_scale=2.000, wall=3011, train_wall=2684
| epoch 001: 1650 / 1891 loss=9.225, nll_loss=8.209, ppl=295.85, wps=55519, ups=0.5, wpb=104655, bsz=5295, num_updates=1645, lr=0.000329059, gnorm=1.101, clip=0%, oom=0, loss_scale=2.000, wall=3101, train_wall=2767
I have tried removed the --fp16
flag and here is the log:
| distributed init (rank 2): tcp://localhost:12653
| distributed init (rank 0): tcp://localhost:12653
| distributed init (rank 1): tcp://localhost:12653
| distributed init (rank 3): tcp://localhost:12653
Namespace(adam_betas='(0.9, 0.997)', adam_eps=1e-08, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, arch='transformer_wmt_en_de_big', attention_dropout=0.1, bucket_cap_mb=150, clip_norm=25, criterion='label_smoothed_cross_entropy', data=['./data/train_data'], ddp_backend='no_c10d', decoder_attention_heads=16, decoder_embed_dim=1024, decoder_embed_path=None, decoder_ffn_embed_dim=4096, decoder_input_dim=1024, decoder_layers=6, decoder_learned_pos=False, decoder_normalize_before=False, decoder_output_dim=1024, device_id=0, distributed_backend='nccl', distributed_init_host='localhost', distributed_init_method='tcp://localhost:12653', distributed_port=12654, distributed_rank=0, distributed_world_size=4, dropout=0.3, encoder_attention_heads=16, encoder_embed_dim=1024, encoder_embed_path=None, encoder_ffn_embed_dim=4096, encoder_layers=6, encoder_learned_pos=False, encoder_normalize_before=False, fp16=False, fp16_init_scale=128, keep_interval_updates=-1, label_smoothing=0.1, left_pad_source='True', left_pad_target='False', log_format=None, log_interval=50, lr=[0.0008], lr_scheduler='inverse_sqrt', lr_shrink=0.1, max_epoch=0, max_sentences=2000, max_sentences_valid=2000, max_source_positions=1024, max_target_positions=1024, max_tokens=3200, max_update=500000, min_loss_scale=0.0001, min_lr=1e-10, momentum=0.99, no_epoch_checkpoints=False, no_progress_bar=False, no_save=False, no_token_positional_embeddings=False, optimizer='adam', optimizer_overrides='{}', raw_text=False, relu_dropout=0.0, reset_lr_scheduler=False, reset_optimizer=False, restore_file='checkpoint_last.pt', save_dir='./checkpoints', save_interval=1, save_interval_updates=600, seed=1, sentence_avg=False, share_all_embeddings=False, share_decoder_input_output_embed=False, skip_invalid_size_inputs_valid_test=False, source_lang='src', target_lang='tgt', task='translation', train_subset='train', update_freq=[12], upsample_primary=1, valid_subset='valid', validate_interval=1, warmup_init_lr=1e-07, warmup_updates=4000, weight_decay=0.0)
| [src] dictionary: 40000 types
| [tgt] dictionary: 50000 types
| ./data/train_data train 10000000 examples
| ./data/train_data valid 3000 examples
| model transformer_wmt_en_de_big, criterion LabelSmoothedCrossEntropyCriterion
| num. model params: 319717376
| training on 4 GPUs
| max tokens per GPU = 3200 and max sentences per GPU = 2000
| NOTICE: your device may support faster training with --fp16
| epoch 001: 50 / 1891 loss=14.582, nll_loss=14.378, ppl=21286.60, wps=20621, ups=0.2, wpb=104342, bsz=5187, num_updates=51, lr=1.02987e-05, gnorm=3.509, clip=0%, oom=0, wall=324, train_wall=251
| epoch 001: 100 / 1891 loss=13.768, nll_loss=13.468, ppl=11333.83, wps=20624, ups=0.2, wpb=104666, bsz=5273, num_updates=101, lr=2.02975e-05, gnorm=2.250, clip=0%, oom=0, wall=579, train_wall=499
| epoch 001: 150 / 1891 loss=13.181, nll_loss=12.809, ppl=7176.27, wps=20597, ups=0.2, wpb=104571, bsz=5298, num_updates=151, lr=3.02962e-05, gnorm=1.971, clip=0%, oom=0, wall=833, train_wall=746
| epoch 001: 200 / 1891 loss=12.734, nll_loss=12.295, ppl=5026.19, wps=20624, ups=0.2, wpb=104738, bsz=5305, num_updates=201, lr=4.0295e-05, gnorm=1.803, clip=0%, oom=0, wall=1087, train_wall=993
| epoch 001: 250 / 1891 loss=12.396, nll_loss=11.901, ppl=3825.51, wps=20636, ups=0.2, wpb=104947, bsz=5339, num_updates=251, lr=5.02937e-05, gnorm=1.784, clip=0%, oom=0, wall=1343, train_wall=1241
| epoch 001: 300 / 1891 loss=12.121, nll_loss=11.581, ppl=3063.27, wps=20655, ups=0.2, wpb=105062, bsz=5350, num_updates=301, lr=6.02925e-05, gnorm=1.708, clip=0%, oom=0, wall=1597, train_wall=1489
| epoch 001: 350 / 1891 loss=11.881, nll_loss=11.302, ppl=2524.68, wps=20650, ups=0.2, wpb=105005, bsz=5339, num_updates=351, lr=7.02912e-05, gnorm=1.629, clip=0%, oom=0, wall=1851, train_wall=1735
| epoch 001: 400 / 1891 loss=11.669, nll_loss=11.055, ppl=2127.60, wps=20665, ups=0.2, wpb=105057, bsz=5336, num_updates=401, lr=8.029e-05, gnorm=1.608, clip=0%, oom=0, wall=2105, train_wall=1982
| epoch 001: 450 / 1891 loss=11.480, nll_loss=10.834, ppl=1825.88, wps=20640, ups=0.2, wpb=104942, bsz=5329, num_updates=451, lr=9.02887e-05, gnorm=1.564, clip=0%, oom=0, wall=2359, train_wall=2230
| epoch 001: 500 / 1891 loss=11.308, nll_loss=10.634, ppl=1589.20, wps=20630, ups=0.2, wpb=104885, bsz=5325, num_updates=501, lr=0.000100287, gnorm=1.540, clip=0%, oom=0, wall=2613, train_wall=2477
| epoch 001: 550 / 1891 loss=11.151, nll_loss=10.451, ppl=1400.25, wps=20620, ups=0.2, wpb=104813, bsz=5314, num_updates=551, lr=0.000110286, gnorm=1.501, clip=0%, oom=0, wall=2867, train_wall=2724
| epoch 001 | valid on 'valid' subset | valid_loss 8.07049 | valid_nll_loss 6.84806 | valid_ppl 115.21 | num_updates 600
what's more, the GPU's type is Tesla V100-SXM2. and it's memory is 16G.
Look from the wps, I'm worried about my training speed
Your latest comment has a different architecture than your first comment, you changed from transformer_wmt_en_de
to transformer_wmt_en_de_big
. Different architectures will have different training speeds, so it'll be helpful to keep that consistent.
Another thing I forgot to mention is that you should install apex: https://github.com/NVIDIA/apex/. Make sure to installing Apex with the CUDA and C++ extensions. Fairseq will automatically pick them up and it should improve speed by a decent amount.
Also can you upgrade to a newer version of fairseq? I see from the directory name it says 0.6.0.
Your latest comment has a different architecture than your first comment, you changed from
transformer_wmt_en_de
totransformer_wmt_en_de_big
. Different architectures will have different training speeds, so it'll be helpful to keep that consistent.Another thing I forgot to mention is that you should install apex: https://github.com/NVIDIA/apex/. Make sure to installing Apex with the CUDA and C++ extensions. Fairseq will automatically pick them up and it should improve speed by a decent amount.
Also can you upgrade to a newer version of fairseq? I see from the directory name it says 0.6.0.
Thank you very much for your advice. According to your suggestion, my problem has been solved perfectly. Thank you once again!
Hello, I'm a newcomer to NLP. I have tried to install cuda, cudnn, NCCL and pytorch myself, but I don't know if my training process is normal. Here is my training log:
I found that the parameter wps is about 50K, is this normal? I am worried that the installation of cudnn or NCCL has problems, which leads to slow training speed.