facebookresearch / XLM

PyTorch original implementation of Cross-lingual Language Model Pretraining.
Other
2.87k stars 495 forks source link

Multiple GPU speedup #324

Open asolano opened 3 years ago

asolano commented 3 years ago

Greetings!

While trying to train the model from scratch following the documentation we encountered the following situation: the time it takes for an epoch is almost the same with 1 or 2 GPUs (same brand and model).

The logs with 1 GPU:

INFO - 12/11/20 15:31:54 - 0:00:54 - ============ Starting epoch 0 ... ============
INFO - 12/11/20 16:10:56 - 0:39:56 - ============ End of epoch 0 ============
INFO - 12/11/20 16:12:03 - 0:41:03 - epoch -> 0.000000
INFO - 12/11/20 16:12:03 - 0:41:03 - __log__:{"epoch": 0, "valid_en_mlm_ppl": 1583.4120675715023, "valid_en_mlm_acc": 5.642639552946398, "valid_mlm_ppl": 1583.4120675715023, "valid_mlm_acc": 5.642639552946398, "test_en_mlm_ppl": 1620.3790404090796, "test_en_mlm_acc": 5.763325402151588, "test_mlm_ppl": 1620.3790404090796, "test_mlm_acc": 5.763325402151588}
INFO - 12/11/20 16:12:06 - 0:41:06 - ============ Starting epoch 1 ... ============
INFO - 12/11/20 16:52:39 - 1:21:39 - ============ End of epoch 1 ============
INFO - 12/11/20 16:53:41 - 1:22:40 - epoch -> 1.000000
INFO - 12/11/20 16:53:42 - 1:22:41 - __log__:{"epoch": 1, "valid_en_mlm_ppl": 1448.9948795385392, "valid_en_mlm_acc": 5.634569993342613, "valid_mlm_ppl": 1448.9948795385392, "valid_mlm_acc": 5.634569993342613, "test_en_mlm_ppl": 1482.7229495495092, "test_en_mlm_acc": 5.803998129054563, "test_mlm_ppl": 1482.7229495495092, "test_mlm_acc": 5.803998129054563}
INFO - 12/11/20 16:53:47 - 1:22:47 - ============ Starting epoch 2 ... ============
INFO - 12/11/20 17:34:04 - 2:03:04 - ============ End of epoch 2 ============
INFO - 12/11/20 17:35:06 - 2:04:06 - epoch -> 2.000000
INFO - 12/11/20 17:35:06 - 2:04:06 - __log__:{"epoch": 2, "valid_en_mlm_ppl": 1346.196470707116, "valid_en_mlm_acc": 5.725352538885191, "valid_mlm_ppl": 1346.196470707116, "valid_mlm_acc": 5.725352538885191, "test_en_mlm_ppl": 1380.3711707742302, "test_en_mlm_acc": 5.8446708559575375, "test_mlm_ppl": 1380.3711707742302, "test_mlm_acc": 5.8446708559575375}
INFO - 12/11/20 17:35:11 - 2:04:11 - ============ Starting epoch 3 ... ============

So about 40 minutes per epoch (for this system).

With 2 GPUS:

INFO - 12/14/20 09:40:41 - 0:00:57 - ============ Starting epoch 0 ... ============
INFO - 12/14/20 10:25:57 - 0:46:13 - ============ End of epoch 0 ============
INFO - 12/14/20 10:25:57 - 0:46:13 - ============ End of epoch 0 ============
INFO - 12/14/20 10:26:56 - 0:47:12 - epoch -> 0.000000
INFO - 12/14/20 10:26:56 - 0:47:12 - __log__:{"epoch": 0, "valid_en_mlm_ppl": 1547.8094786700792, "valid_en_mlm_acc": 5.6446569428473445, "valid_mlm_ppl": 1547.8094786700792, "valid_mlm_acc": 5.6446569428473445, "test_en_mlm_ppl": 1585.7715112157587, "test_en_mlm_acc": 5.79179631098367, "test_mlm_ppl": 1585.7715112157587, "test_mlm_acc": 5.79179631098367}
INFO - 12/14/20 10:26:57 - 0:47:13 - epoch -> 0.000000
INFO - 12/14/20 10:26:57 - 0:47:13 - ============ Starting epoch 1 ... ============
INFO - 12/14/20 10:26:59 - 0:47:15 - ============ Starting epoch 1 ... ============
INFO - 12/14/20 11:12:05 - 1:32:21 - ============ End of epoch 1 ============
INFO - 12/14/20 11:12:05 - 1:32:21 - ============ End of epoch 1 ============
INFO - 12/14/20 11:13:03 - 1:33:19 - epoch -> 1.000000
INFO - 12/14/20 11:13:03 - 1:33:19 - __log__:{"epoch": 1, "valid_en_mlm_ppl": 1367.4183438302562, "valid_en_mlm_acc": 5.6446569428473445, "valid_mlm_ppl": 1367.4183438302562, "valid_mlm_acc": 5.6446569428473445, "test_en_mlm_ppl": 1404.7183259102362, "test_en_mlm_acc": 5.820267219815753, "test_mlm_ppl": 1404.7183259102362, "test_mlm_acc": 5.820267219815753}
INFO - 12/14/20 11:13:04 - 1:33:20 - epoch -> 1.000000
INFO - 12/14/20 11:13:04 - 1:33:20 - ============ Starting epoch 2 ... ============
INFO - 12/14/20 11:13:07 - 1:33:23 - ============ Starting epoch 2 ... ============
INFO - 12/14/20 11:58:07 - 2:18:23 - ============ End of epoch 2 ============
INFO - 12/14/20 11:58:07 - 2:18:23 - ============ End of epoch 2 ============
INFO - 12/14/20 11:59:06 - 2:19:22 - epoch -> 2.000000
INFO - 12/14/20 11:59:06 - 2:19:22 - __log__:{"epoch": 2, "valid_en_mlm_ppl": 1272.2453845315492, "valid_en_mlm_acc": 5.689039520668159, "valid_mlm_ppl": 1272.2453845315492, "valid_mlm_acc": 5.689039520668159, "test_en_mlm_ppl": 1295.0331200124099, "test_en_mlm_acc": 5.9036463099668515, "test_mlm_ppl": 1295.0331200124099, "test_mlm_acc": 5.9036463099668515}
INFO - 12/14/20 11:59:06 - 2:19:22 - epoch -> 2.000000
INFO - 12/14/20 11:59:06 - 2:19:22 - ============ Starting epoch 3 ... ============

So about 47-48 minutes per epoch (slower than 1 GPU).

For reference, the commands are:

1 GPU

$ python train.py \
--exp_name xlm_en \
--dump_path ./dumped \
--data_path $OUTPATH \
--lgs 'en' \
--clm_steps '' \
--mlm_steps 'en' \
--emb_dim 512 \
--n_layers 12 \
--n_heads 16 \
--dropout 0.1 \
--attention_dropout 0.1 \
--gelu_activation true \
--batch_size 32 \
--bptt 256 \
--optimizer adam_inverse_sqrt,lr=0.00010,warmup_updates=30000,beta1=0.9,beta2=0.999,weight_decay=0.01,eps=0.000001 \
--epoch_size 300000 \
--max_epoch 100000 \
--validation_metrics _valid_en_mlm_ppl \
--stopping_criterion _valid_en_mlm_ppl,25 \
--fp16 true \
--word_mask_keep_rand '0.8,0.1,0.1' \
--word_pred '0.15' \
--amp 1 &> out_1gpu.txt

2 GPU (same parameters, only the multi-gpu support was added)

$ export NGPU=2
$ python -m torch.distributed.launch --nproc_per_node=$NGPU train.py \
--exp_name xlm_en \
--dump_path ./dumped \
--data_path $OUTPATH \
--lgs 'en' \
--clm_steps '' \
--mlm_steps 'en' \
--emb_dim 512 \
--n_layers 12 \
--n_heads 16 \
--dropout 0.1 \
--attention_dropout 0.1 \
--gelu_activation true \
--batch_size 32 \
--bptt 256 \
--optimizer adam_inverse_sqrt,lr=0.00010,warmup_updates=30000,beta1=0.9,beta2=0.999,weight_decay=0.01,eps=0.000001 \
--epoch_size 300000 \
--max_epoch 100000 \
--validation_metrics _valid_en_mlm_ppl \
--stopping_criterion _valid_en_mlm_ppl,25 \
--fp16 true \
--word_mask_keep_rand '0.8,0.1,0.1' \
--word_pred '0.15' \
--amp 1 &> out_2gpu.txt 

Is this the expected behavior? Is time per epoch the wrong metric to measure performance in this case?

Any insight on this would be much appreciated.