Distributed training hangs on dist.init_process_group()

Hi everyone!

Whenever I try and run distributed training using: python -m multiproc train.py --output_directory=outdir --log_directory=logdir --hparams=distributed_run=True,fp16_run=True

The code reaches: dist.init_process_group(), but hangs after that and does not move forward. I'm running on a AWS ec2 instance with the g4dn.12xlarge instance type (4 GPUs on a single instance).

I am using the following package versions: `Package Version

absl-py 1.0.0 astor 0.8.1 audioread 2.1.9 cached-property 1.5.2 cachetools 4.2.4 certifi 2021.10.8 cffi 1.14.4 charset-normalizer 2.0.12 cycler 0.11.0 dataclasses 0.8 decorator 5.1.1 future 0.18.2 gast 0.2.2 google-auth 2.6.0 google-auth-oauthlib 0.4.6 google-pasta 0.2.0 grpcio 1.44.0 h5py 3.1.0 idna 3.3 importlib-metadata 4.8.3 inflect 0.2.5 joblib 1.1.0 Keras-Applications 1.0.8 Keras-Preprocessing 1.1.2 librosa 0.6.0 llvmlite 0.31.0 Markdown 3.3.6 matplotlib 2.1.0 numba 0.48.0 numpy 1.16.0 oauthlib 3.2.0 olefile 0.46 opt-einsum 3.3.0 Pillow 8.3.2 pip 21.3.1 protobuf 3.19.4 pyasn1 0.4.8 pyasn1-modules 0.2.8 pycparser 2.21 pyparsing 3.0.7 python-dateutil 2.8.2 pytz 2021.3 requests 2.27.1 requests-oauthlib 1.3.1 resampy 0.2.2 rsa 4.8 scikit-learn 0.24.2 scipy 1.0.0 setuptools 58.0.4 six 1.16.0 tensorboard 1.15.0 tensorboard-data-server 0.6.1 tensorboard-plugin-wit 1.8.1 tensorflow 1.15.2 tensorflow-estimator 1.15.1 termcolor 1.1.0 threadpoolctl 3.1.0 torch 1.5.0 torchvision 0.6.0a0+82fd1c8 typing_extensions 4.1.1 Unidecode 1.0.22 urllib3 1.26.8 Werkzeug 2.0.3 wheel 0.37.1 wrapt 1.14.0 zipp 3.6.0`

Any help would be greatly appreciated!!

NVIDIA / tacotron2

Distributed training hangs on dist.init_process_group() #552