Have problom in BERT pre-training: how to training on multiple GPUs

yangshuo0323 commented 3 years ago

Description

I want to train BERT model on GPU, but have some problems. My configuration:
- Software environment: Python: 3.7.7, Cuda: 10.2
- Install MXNet: pip install mxnet-cu102 , verion is 1.7.0
- Download Model script: https://github.com/dmlc/gluon-nlp.

Run script gluon-nlp/scripts/bert/run_pretraining.py:

Reference the instruction: https://nlp.gluon.ai/model_zoo/bert/index.html#bert-model-zoo

$  mpirun -np 8 -H localhost:8 -mca pml ob1 -mca btl ^openib \
 -mca btl_tcp_if_exclude docker0,lo --map-by ppr:4:socket \
 --mca plm_rsh_agent 'ssh -q -o StrictHostKeyChecking=no' \
 -x NCCL_MIN_NRINGS=8 -x NCCL_DEBUG=INFO -x HOROVOD_HIERARCHICAL_ALLREDUCE=1 \
 -x MXNET_SAFE_ACCUMULATION=1 --tag-output \
python run_pretraining.py --verbose --model="bert_12_768_12" --warmup_ratio=1 --comm_backend="horovod" \
--accumulate=1 --max_seq_length=128 --raw --max_predictions_per_seq=20 --log_interval=1 --ckpt_interval=1000 \
--no_compute_acc --data=/home/yangshuo/mxnet/Dataset/pre-train-datasets/enwiki-feb-doc-split/*.train \
--num_steps=1000 --total_batch_size=128 --dtype="float16"

Result error:

Seek help:

Can I have correct instruction or suggestion ? thanks.

github-actions[bot] commented 3 years ago

Welcome to Apache MXNet (incubating)! We are on a mission to democratize AI, and we are glad that you are contributing to it by opening this issue. Please make sure to include all the relevant context, and one of the @apache/mxnet-committers will be here shortly. If you are interested in contributing to our project, let us know! Also, be sure to check out our guide on contributing to MXNet and our development guides wiki.

szha commented 3 years ago

This issue is being handled in https://github.com/dmlc/gluon-nlp/issues/1508

apache / mxnet

Have problom in BERT pre-training: how to training on multiple GPUs #19800

Description

Seek help: