Lightning-AI / pytorch-lightning

Pretrain, finetune and deploy AI models on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.01k stars 3.36k forks source link

RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1639180549130/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, unhandled system error, NCCL version 21.0.3 #13400

Closed himanshucodz55 closed 2 years ago

himanshucodz55 commented 2 years ago

🐛 Bug

I am trying to train model in distributed mode using espnet2 librispeech recipe. but it's failed after few minutes. using two oracle instance with single gpu.

error after running the code

python3 -m espnet2.bin.asr_train --multiprocessing_distributed true --ngpu 1 --dist_rank 1 --dist_world_size 2 --dist_master_addr 192.9.138.186 --dist_master_port 8894 --use_preprocessor true --bpemodel none --token_type phn --token_list /home/ubuntu/users/bramhendra/espnet/egs2/librispeech/asr1/data/en_token_list/phn/tokens.txt --non_linguistic_symbols none --cleaner none --g2p g2p_en --valid_data_path_and_name_and_type /home/ubuntu/users/bramhendra/espnet/egs2/librispeech/asr1/dump/raw/dev_set/wav.scp,speech,sound --valid_data_path_and_name_and_type /home/ubuntu/users/bramhendra/espnet/egs2/librispeech/asr1/dump/raw/dev_set/text,text,text --valid_shape_file /home/ubuntu/users/bramhendra/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_phn//valid/speech_shape --valid_shape_file /home/ubuntu/users/bramhendra/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_phn//valid/text_shape.phn --resume true --init_param --ignore_init_mismatch false --fold_length 80000 --fold_length 150 --output_dir exp/asr_conformer_lr2e-3_8k_nospec_warmup25k_amp_nondeterministic --config /home/ubuntu/users/bramhendra/espnet/egs2/librispeech/asr1/conf/train_asr.yaml --frontend_conf fs=8k --normalize=global_mvn --normalize_conf stats_file=/home/ubuntu/users/bramhendra/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_phn//train/feats_stats.npz --train_data_path_and_name_and_type /home/ubuntu/users/bramhendra/espnet/egs2/librispeech/asr1/dump/raw/train_5hr/wav.scp,speech,sound --train_data_path_and_name_and_type /home/ubuntu/users/bramhendra/espnet/egs2/librispeech/asr1/dump/raw/train_5hr/text,text,text --train_shape_file /home/ubuntu/users/bramhendra/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_phn//train/speech_shape --train_shape_file /home/ubuntu/users/bramhendra/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_phn//train/text_shape.phn --ngpu 1 --multiprocessing_distributed True 
 Started at Fri Jun 24 05:15:11 UTC 2022

/home/ubuntu/users/srikanth/envs/espnet_env/envs/espnet_env/bin/python3 /home/ubuntu/users/srikanth/espnet/espnet2/bin/asr_train.py --multiprocessing_distributed true --ngpu 1 --dist_rank 1 --dist_world_size 2 --dist_master_addr 192.9.138.186 --dist_master_port 8894 --use_preprocessor true --bpemodel none --token_type phn --token_list /home/ubuntu/users/bramhendra/espnet/egs2/librispeech/asr1/data/en_token_list/phn/tokens.txt --non_linguistic_symbols none --cleaner none --g2p g2p_en --valid_data_path_and_name_and_type /home/ubuntu/users/bramhendra/espnet/egs2/librispeech/asr1/dump/raw/dev_set/wav.scp,speech,sound --valid_data_path_and_name_and_type /home/ubuntu/users/bramhendra/espnet/egs2/librispeech/asr1/dump/raw/dev_set/text,text,text --valid_shape_file /home/ubuntu/users/bramhendra/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_phn//valid/speech_shape --valid_shape_file /home/ubuntu/users/bramhendra/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_phn//valid/text_shape.phn --resume true --init_param --ignore_init_mismatch false --fold_length 80000 --fold_length 150 --output_dir exp/asr_conformer_lr2e-3_8k_nospec_warmup25k_amp_nondeterministic --config /home/ubuntu/users/bramhendra/espnet/egs2/librispeech/asr1/conf/train_asr.yaml --frontend_conf fs=8k --normalize=global_mvn --normalize_conf stats_file=/home/ubuntu/users/bramhendra/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_phn//train/feats_stats.npz --train_data_path_and_name_and_type /home/ubuntu/users/bramhendra/espnet/egs2/librispeech/asr1/dump/raw/train_5hr/wav.scp,speech,sound --train_data_path_and_name_and_type /home/ubuntu/users/bramhendra/espnet/egs2/librispeech/asr1/dump/raw/train_5hr/text,text,text --train_shape_file /home/ubuntu/users/bramhendra/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_phn//train/speech_shape --train_shape_file /home/ubuntu/users/bramhendra/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_phn//train/text_shape.phn --ngpu 1 --multiprocessing_distributed True
hp-distributed-02:4619:4619 [0] NCCL INFO Bootstrap : Using ens3:10.0.0.202<0>
hp-distributed-02:4619:4619 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

hp-distributed-02:4619:4619 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
hp-distributed-02:4619:4619 [0] NCCL INFO NET/Socket : Using [0]ens3:10.0.0.202<0>
hp-distributed-02:4619:4619 [0] NCCL INFO Using network Socket
hp-distributed-02:4619:4767 [0] NCCL INFO Call to connect returned Connection timed out, retrying
hp-distributed-02:4619:4767 [0] NCCL INFO Call to connect returned Connection timed out, retrying

hp-distributed-02:4619:4767 [0] include/socket.h:409 NCCL WARN Net : Connect to 10.0.0.196<52027> failed : Connection timed out
hp-distributed-02:4619:4767 [0] NCCL INFO bootstrap.cc:360 -> 2
hp-distributed-02:4619:4767 [0] NCCL INFO init.cc:501 -> 2
hp-distributed-02:4619:4767 [0] NCCL INFO init.cc:904 -> 2
hp-distributed-02:4619:4767 [0] NCCL INFO group.cc:72 -> 2 [Async thread]
Traceback (most recent call last):
  File "/home/ubuntu/users/srikanth/envs/espnet_env/envs/espnet_env/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/ubuntu/users/srikanth/envs/espnet_env/envs/espnet_env/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/ubuntu/users/srikanth/espnet/espnet2/bin/asr_train.py", line 23, in <module>
    main()
  File "/home/ubuntu/users/srikanth/espnet/espnet2/bin/asr_train.py", line 19, in main
    ASRTask.main(cmd=cmd)
  File "/home/ubuntu/users/srikanth/espnet/espnet2/tasks/abs_task.py", line 1011, in main
    cls.main_worker(args)
  File "/home/ubuntu/users/srikanth/espnet/espnet2/tasks/abs_task.py", line 1307, in main_worker
    cls.trainer.run(
  File "/home/ubuntu/users/srikanth/espnet/espnet2/train/trainer.py", line 219, in run
    dp_model = torch.nn.parallel.DistributedDataParallel(
  File "/home/ubuntu/users/srikanth/envs/espnet_env/envs/espnet_env/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 578, in __init__
    dist._verify_model_across_ranks(self.process_group, parameters)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1639180549130/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, unhandled system error, NCCL version 21.0.3
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
 Accounting: time=410 threads=1
 Ended (code 1) at Fri Jun 24 05:22:01 UTC 2022, elapsed time 410 seconds

Expected behavior

code is running using conda enviournment and pytorch.

script that i have used...

running stage 11 of espnet2 librispeech recipe...

${python} -m espnet2.bin.asr_train \
            --multiprocessing_distributed true \
            --ngpu 1 \
            --dist_rank 0 \
            --dist_world_size 2 \
            --dist_master_addr "192.9.138.186" \
            --dist_master_port "8894"

cc @justusschock @kaushikb11 @awaelchli @akihironitta @rohitgr7

akihironitta commented 2 years ago

@himanshucodz55 I don't see your error related to PyTorch Lightning. Could you elaborate more if you think it's relevant?

awaelchli commented 2 years ago

@himanshucodz55 Could you report this issue on https://github.com/espnet/espnet please? Closing this, as espnet2 is not using PyTorch Lightning. There must have been a confusion.