BERT multi-node pretraining hangs when using Horovod with HOROVOD_GPU_OPERATION=MPI

Description

BERT Pretraining with Horovod hangs intermittently in the middle of the training, this happens around 1 out of 4 runs and hangs at mostly around 4400 steps

Currently I can only reproduce it with Horovod CPU mode built with HOROVOD_GPU_OPERATION=MPI, but cannot reproduce with HOROVOD_GPU_OPERATION=NCCL

More inspection shows that the hangs happens in one of the rank process the datasetloader creating a lot of threads, causing that process to hang, and other process will wait for that process causing the whole training to hang

Error Message

From mpirun output error message regarding the hanging process

[1,56]<stderr>:Exception in thread Thread-1:
[1,56]<stderr>:Traceback (most recent call last):
[1,56]<stderr>:  File "/shared/mx_env/lib/python3.8/threading.py", line 932, in _bootstrap_inner
[1,56]<stderr>:    self.run()
[1,56]<stderr>:  File "/shared/mx_env/lib/python3.8/threading.py", line 870, in run
[1,56]<stderr>:    self._target(*self._args, **self._kwargs)
[1,56]<stderr>:  File "/shared/mx_env/lib/python3.8/multiprocessing/managers.py", line 192, in accepter
[1,56]<stderr>:    t.start()
[1,56]<stderr>:  File "/shared/mx_env/lib/python3.8/threading.py", line 852, in start
[1,56]<stderr>:    _start_new_thread(self._bootstrap, ())
[1,56]<stderr>:RuntimeError: can't start new thread

If putting error logs, at the time of hanging the batchify worker pool function won't be run https://github.com/dmlc/gluon-nlp/blob/v0.10.x/src/gluonnlp/data/datasetloader.py#L55

To Reproduce

(If you developed your own code, please provide a short script that reproduces the error. For existing examples, please provide link.)

Steps to reproduce

(Paste the commands you ran that produced the error.)

Have 8 p3dn.24xlarge machines and form a cluster that can run mpi, write all hostname into a host file
Mount a shared fsx drive and have a conda env with installation of latest Horovod and gluonnlp

Download BERT data to shared folder and run pretraining using 0.10.0 scripts

/opt/amazon/openmpi/bin/mpirun --hostfile /shared/hostfile -N 8 --allow-run-as-root --mca plm_rsh_no_tree_spawn 1 \
--tag-output -mca btl_tcp_if_include ens5 -x RDMAV_FORK_SAFE=1 -x LD_LIBRARY_PATH -x PATH -x \
NCCL_SOCKET_IFNAME=ens5 --oversubscribe -x MXNET_SAFE_ACCUMULATION=1 -x FI_PROVIDER="efa" \
/shared/mx_env/bin/python /shared/gluon-nlp/scripts/bert/run_pretraining.py --data='/shared/bert_data/*.train' \
--data_eval='/shared/bert_eval/*.dev' --num_steps 125000 --lr 1e-4 --total_batch_size 2048 --total_batch_size_eval 2048 \
--accumulate 1 --raw --comm_backend horovod --max_seq_length 128 --max_predictions_per_seq 20 \
--num_dataset_workers 4 --num_batch_workers 2 --circle_length 2 --verbose --log_interval 1

Can comment out backward and train.step to make reproducing faster

What have you tried to solve it?

Try reproducing with NCCL to narrow down the issue

Environment

Horovod=0.20.0 with HOROVOD_GPU_OPERATION=MPI gluonnlp==0.10.0 mxnet_cu101==1.6.0

dmlc / gluon-nlp