gaopengcuhk commented 5 years ago

I followed your BERT pretraining. However, after one week of training, the loss is still around 7.3. I use 8 GPU with 14 per batch. The rest is same as default.

INFO - 08/01/19 14:20:20 - 2:02:04 - 3550 - 6.84 sent/s - 267.24 words/s - MLM-en: 7.5730 - - model LR: 1.0000e-04 INFO - 08/01/19 14:20:20 - 2:02:04 - 3550 - 6.84 sent/s - 258.91 words/s - MLM-en: 7.5390 - - model LR: 1.0000e-04 INFO - 08/01/19 14:20:20 - 2:02:04 - 3550 - 6.84 sent/s - 256.56 words/s - MLM-en: 7.5296 - - model LR: 1.0000e-04 INFO - 08/01/19 14:20:30 - 2:02:14 - 3555 - 6.83 sent/s - 260.45 words/s - MLM-en: 7.4089 - - model LR: 1.0000e-04 INFO - 08/01/19 14:20:30 - 2:02:14 - 3555 - 6.83 sent/s - 260.38 words/s - MLM-en: 7.4161 - - model LR: 1.0000e-04 INFO - 08/01/19 14:20:30 - 2:02:14 - 3555 - 6.83 sent/s - 253.03 words/s - MLM-en: 7.5330 - - model LR: 1.0000e-04 INFO - 08/01/19 14:20:30 - 2:02:14 - 3555 - 6.83 sent/s - 263.87 words/s - MLM-en: 7.4091 - - model LR: 1.0000e-04 INFO - 08/01/19 14:20:30 - 2:02:14 - 3555 - 6.83 sent/s - 264.74 words/s - MLM-en: 7.4053 - - model LR: 1.0000e-04

gaopengcuhk commented 5 years ago

The MLM-en will go around 7.5 in a few iteractions, then never change afterwards.

glample commented 5 years ago

Can you provide your train.log file?

gaopengcuhk commented 5 years ago

I will send my Log to your email in a few days. It seems like the problem of SLURM. When I train BERT in my own server, the loss will go to 3.4 in a few hours. However, when I use cluster in my company with SLURM, the loss never go down.

gaopengcuhk commented 5 years ago

Here is the command I use to run the one node 8 GPU distribution training:

By running the code srun --grep gpu:8 -p clusterRTX -c 24 python -m torch.distributed.launch --nproc_per_node=8 train.py

I run into the following problem assert params.local_rank == -1 Assertion failed

Then I deleted the following line assert params.local_rank == -1

The code can run successly. However, it seems that all process use the same GPU with the rest GPU unused. By diving deeper into your code, I find that int(os.environ['SLURM_LOCALID']) give the same localid to all process.

Then I modifed your code by running the following code

multi-GPU job (local or multi-node) - jobs started with torch.distributed.launch

elif params.local_rank != -1:

    assert params.master_port == -1

    # read environment variables
    params.global_rank = int(os.environ['RANK'])
    params.world_size = int(os.environ['WORLD_SIZE'])
    params.n_gpu_per_node = 8

    # number of nodes / node ID
    params.n_nodes = params.world_size // params.n_gpu_per_node
    params.node_id = params.global_rank // params.n_gpu_per_node

Code can run successful on cluster but the loss function will not converge.

aconneau commented 5 years ago

Hmm not sure, we'll need your logs. Also, you might have forgotten some slurm parameters: --ncpu 8 --ngpu 8 --ntasks 8 --nodes 1 (and possibly --constraint="volta32gb" for fp16)?

gaopengcuhk commented 5 years ago

srun -p clusterRTX --gres=gpu:8 --ntasks=8 --ntasks-per-node=8 --cpus-per-task=3 python train.py --exp_name xlm_en --dump_path ./dumped --data_path ./data/processed/XLM_en/30k/ --lgs 'en' --clm_steps '' --mlm_steps 'en' --emb_dim 2048 --n_layers 8 --n_heads 16 --dropout 0.1 --attention_dropout 0.1 --gelu_activation true --batch_size 8 --bptt 256 --optimizer adam,lr=0.0001 --epoch_size 300000 --max_epoch 100000 --validation_metrics _valid_en_mlm_ppl --stopping_criterion _valid_en_mlm_ppl,25 --fp16 false --word_mask_keep_rand '0.8,0.1,0.1' --word_pred '0.15'

This is my script.

I run into the following issue.

https://github.com/facebookresearch/XLM/issues/159

facebookresearch / XLM

Can you check whether my BERT pretraining is normal ofr not? #154

multi-GPU job (local or multi-node) - jobs started with torch.distributed.launch