ai4d-iasc / trixie

Scripts and documentation about trixie hpc
17 stars 3 forks source link

Wider allowed port range #77

Open SamuelLarkin opened 2 years ago

SamuelLarkin commented 2 years ago

Hi, I'm trying to run the new Sockeye-3 in multi-nodes with multi-gpus and it fails. I opened a ticket with sockeye and their hypothesis is that the allowed port range is to small. Sockeye-3 uses pytorch-1.10 and NCCL and tries to create a C10D rendez-vous service to synchronize the workers but there is no way to specify a port, it randomly chooses one.

My request is to widen the allowed port range on trixie's worker nodes and the head node.

Note for myself: /gpfs/projects/DT/mtp/models/Senate_HOC-Debates-2019-11-15/nmt.sockeye-3/v3-HoC/en2fr/baseline.hf.sockeye-3.1.0_2nodes

source /gpfs/projects/DT/mtp/models/Senate_HOC-Debates-2019-11-15/nmt.sockeye-3/tools/activate
sbatch train.slurm
nrcfieldsa commented 2 years ago

In order to allow a wider port range both head nodes and compute nodes will need to have the following kernel tunable set:

net.ipv4.ip_local_port_range = 1192 65535

April 6, 2022 5:32 PM

All the trixie head and compute node are presently set for 32768-60999

nrcfieldsa commented 2 years ago

The kernel sysctl setting was applied to trixie head nodes and compute nodes in /etc/sysctl.conf, and is now set correctly to support jobs which initiate many connections, beginning at lower starting ports. A backup of original configuration is in /root/etc_sysctl.d_99-sysctl.conf.bak file.

Further troubleshooting shows this is not the current issue stopping the training job from working now and there is another HPC SLURM / application-specific issue being tracked in upstream project's issue tracker, which is still preventing communication or related to task/gpu resource allocation across nodes.

nrcfieldsa commented 2 years ago

Settings have been re-applied:

net.ipv4.ip_local_port_range = 2048 65000

Please confirm this is working post-upgrade.

nrcfieldsa commented 2 years ago

@SamuelLarkin : Can this issue be resolved now?

SamuelLarkin commented 2 years ago

I'm still unable to run sockeye on trixie. I get the following error message.

Sockeye-3-multi-nodes-159008.1.out

Started sockeye.train at Wed Aug 24 14:18:30 EDT 2022
time torchrun --no_python --nnodes=2 --nproc_per_node=1 --node_rank=0 --master_addr=10.10.0.135 --master_port=32805 --rdzv_id=159008 --rdzv_backend=c10d --rdzv_endpoi
nt=10.10.0.135:32916 sockeye-train --dist --quiet-secondary-workers --config=model_config.yaml
cn135:30193:30193 [0] NCCL INFO Bootstrap : Using ib0:10.11.0.135<0>
cn135:30193:30193 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
cn135:30193:30193 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:10.11.0.135<0>
cn135:30193:30193 [0] NCCL INFO Using network IB
NCCL version 2.10.3+cuda10.2
cn135:30193:30246 [0] NCCL INFO Channel 00/02 :    0   1
cn135:30193:30246 [0] NCCL INFO Channel 01/02 :    0   1
cn135:30193:30246 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1
cn135:30193:30246 [0] NCCL INFO Channel 00 : 1[8a000] -> 0[89000] [receive] via NET/IB/0
cn135:30193:30246 [0] NCCL INFO Channel 01 : 1[8a000] -> 0[89000] [receive] via NET/IB/0
cn135:30193:30246 [0] NCCL INFO Channel 00 : 0[89000] -> 1[8a000] [send] via NET/IB/0
cn135:30193:30246 [0] NCCL INFO Channel 01 : 0[89000] -> 1[8a000] [send] via NET/IB/0

cn135:30193:30246 [0] misc/ibvwrap.cc:252 NCCL WARN Call to ibv_reg_mr failed
cn135:30193:30246 [0] NCCL INFO transport/net_ib.cc:640 -> 2
cn135:30193:30246 [0] NCCL INFO include/net.h:23 -> 2
cn135:30193:30246 [0] NCCL INFO transport/net.cc:223 -> 2
cn135:30193:30246 [0] NCCL INFO transport.cc:111 -> 2
cn135:30193:30246 [0] NCCL INFO init.cc:778 -> 2
cn135:30193:30246 [0] NCCL INFO init.cc:904 -> 2
cn135:30193:30246 [0] NCCL INFO group.cc:72 -> 2 [Async thread]
Traceback (most recent call last):
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/sockeye-train", line 33, in <module>
    sys.exit(load_entry_point('sockeye', 'console_scripts', 'sockeye-train')())
  File "/home/larkins/git/sockeye/sockeye/train.py", line 845, in main
    train(args)
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/home/larkins/git/sockeye/sockeye/train.py", line 872, in train
    resume_training = check_resume(args, output_folder)
  File "/home/larkins/git/sockeye/sockeye/train.py", line 179, in check_resume
    torch.distributed.barrier()
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2776, in barrier
    work = default_pg.barrier(opts=opts)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1646755853042/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1169, unhandled system error, NCCL version 21.0.3
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 30193) of binary: sockeye-train
ERROR:torch.distributed.elastic.multiprocessing.errors.error_handler:no error file defined for parent, to copy child error file (/gpfs/projects/DT/mtp/models/Senate_HOC-Debates-2019-11-15/nmt.sockeye-3/v3-HoC/en2fr/baseline.hf.sockeye-3.1.0_2nodes/tmp/torchelastic_aznz52jc/159008_8iguorrd/attempt_0/0/error.json)
Traceback (most recent call last):
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.11.0', 'console_scripts', 'torchrun')())
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
    run(args)
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
    elastic_launch(
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
sockeye-train FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-08-24_14:18:45
  host      : cn135
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 30193)
  error_file: /gpfs/projects/DT/mtp/models/Senate_HOC-Debates-2019-11-15/nmt.sockeye-3/v3-HoC/en2fr/baseline.hf.sockeye-3.1.0_2nodes/tmp/torchelastic_aznz52jc/159008_8iguorrd/attempt_0/0/error.json
  traceback : Traceback (most recent call last):
    File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/home/larkins/git/sockeye/sockeye/train.py", line 872, in train
      resume_training = check_resume(args, output_folder)
    File "/home/larkins/git/sockeye/sockeye/train.py", line 179, in check_resume
      torch.distributed.barrier()
    File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2776, in barrier
      work = default_pg.barrier(opts=opts)
  RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1646755853042/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1169, unhandled system error, NCCL version 21.0.3
  ncclSystemError: System call (socket, malloc, munmap, etc) failed.

============================================================

real    0m18.540s
user    0m3.384s
sys     0m6.924s
SamuelLarkin commented 2 years ago

Looks like the Infiniti Band is not properly configured or at least not setup to be compatible with pytorch.distributed. If I set NCCL_IB_DISABLE=1, I get a bit further. The master node seems to start and train but the worker node fails.

Sockeye-3-multi-nodes-159009.out

Master Log

Key error message cn135:30932:30993 [0] include/socket.h:423 NCCL WARN Net : Connection closed by remote peer 10.11.0.136<20368> which seems to indicate that the whole thing failed because the worker node's connection dropped.

Started sockeye.train at Wed Aug 24 14:23:30 EDT 2022
time torchrun --no_python --nnodes=2 --nproc_per_node=1 --node_rank=0 --master_addr=10.10.0.135 --master_port=39906 --rdzv_id=159009 --rdzv_backend=c10d --rdzv_endpoi
nt=10.10.0.135:40017 sockeye-train --dist --quiet-secondary-workers --config=model_config.yaml
cn135:30932:30932 [0] NCCL INFO Bootstrap : Using ib0:10.11.0.135<0>
cn135:30932:30932 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
cn135:30932:30932 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
cn135:30932:30932 [0] NCCL INFO NET/Socket : Using [0]ib0:10.11.0.135<0>
cn135:30932:30932 [0] NCCL INFO Using network Socket
NCCL version 2.10.3+cuda10.2
cn135:30932:30989 [0] NCCL INFO Channel 00/02 :    0   1
cn135:30932:30989 [0] NCCL INFO Channel 01/02 :    0   1
cn135:30932:30989 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1
cn135:30932:30989 [0] NCCL INFO Channel 00 : 1[8a000] -> 0[89000] [receive] via NET/Socket/0
cn135:30932:30989 [0] NCCL INFO Channel 01 : 1[8a000] -> 0[89000] [receive] via NET/Socket/0
cn135:30932:30989 [0] NCCL INFO Channel 00 : 0[89000] -> 1[8a000] [send] via NET/Socket/0
cn135:30932:30989 [0] NCCL INFO Channel 01 : 0[89000] -> 1[8a000] [send] via NET/Socket/0
cn135:30932:30989 [0] NCCL INFO Connected all rings
cn135:30932:30989 [0] NCCL INFO Connected all trees
cn135:30932:30989 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
cn135:30932:30989 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
cn135:30932:30989 [0] NCCL INFO comm 0x7ffef0000fa0 rank 0 nranks 2 cudaDev 0 busId 89000 - Init COMPLETE
cn135:30932:30932 [0] NCCL INFO Launch mode Parallel
[INFO:sockeye.utils] Sockeye: 3.1.9, commit unknown, path /home/larkins/git/sockeye/sockeye/__init__.py
[INFO:sockeye.utils] PyTorch: 1.11.0 (/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/__init__.py)

...

[INFO:sockeye.data_io] Shuffling the shards.
[INFO:sockeye.data_io] Loading shard corpora/prepared.shared_vocab/shard.00008.
[INFO:sockeye.data_io] Replicating bucket of 1 sentence(s) 2 times to cover 2 splits.

cn135:30932:30993 [0] include/socket.h:423 NCCL WARN Net : Connection closed by remote peer 10.11.0.136<20368>
cn135:30932:30993 [0] NCCL INFO transport/net_socket.cc:414 -> 2
cn135:30932:30993 [0] NCCL INFO include/net.h:28 -> 2
cn135:30932:30993 [0] NCCL INFO transport/net.cc:459 -> 2
cn135:30932:30993 [0] NCCL INFO proxy.cc:351 -> 2
cn135:30932:30993 [0] NCCL INFO proxy.cc:452 -> 2 [Proxy Thread]
[ERROR:root] Uncaught exception
Traceback (most recent call last):
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/sockeye-train", line 33, in <module>
    sys.exit(load_entry_point('sockeye', 'console_scripts', 'sockeye-train')())
  File "/home/larkins/git/sockeye/sockeye/train.py", line 845, in main
    train(args)
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/home/larkins/git/sockeye/sockeye/train.py", line 915, in train
    train_iter, eval_iter, config_data, source_vocabs, target_vocabs = create_data_iters_and_vocabs(
  File "/home/larkins/git/sockeye/sockeye/train.py", line 283, in create_data_iters_and_vocabs
    train_iter, validation_iter, data_config, source_vocabs, target_vocabs = data_io.get_prepared_data_iters(
  File "/home/larkins/git/sockeye/sockeye/data_io.py", line 840, in get_prepared_data_iters
    train_iter = ShardedParallelSampleIter(shard_fnames,
  File "/home/larkins/git/sockeye/sockeye/data_io.py", line 1710, in __init__
    self.reset()
  File "/home/larkins/git/sockeye/sockeye/data_io.py", line 1745, in reset
    self._load_shard()
  File "/home/larkins/git/sockeye/sockeye/data_io.py", line 1715, in _load_shard
    dataset = ParallelDataSet.load(self.shards_fnames[self.shard_index]).fill_up(self.bucket_batch_sizes,
  File "/home/larkins/git/sockeye/sockeye/data_io.py", line 1448, in fill_up
    target_num_samples = max(utils.all_gather_object(target_num_samples))
  File "/home/larkins/git/sockeye/sockeye/utils.py", line 616, in all_gather_object
    torch.distributed.all_gather_object(obj_list, obj)
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1661, in all_gather_object
    all_gather(output_tensors, input_tensor, group=group)
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2060, in all_gather
    work = default_pg.allgather([tensor_list], [tensor])
RuntimeError: NCCL communicator was aborted on rank 0.  Original reason for failure was: NCCL error: unhandled system error, NCCL version 21.0.3
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 30932) of binary: sockeye-train
ERROR:torch.distributed.elastic.multiprocessing.errors.error_handler:no error file defined for parent, to copy child error file (/gpfs/projects/DT/mtp/models/Senate_HOC-Debates-2019-11-15/nmt.sockeye-3/v3-HoC/en2fr/baseline.hf.sockeye-3.1.0_2nodes/tmp/torchelastic_zzyxq__8/159009_f51vopx0/attempt_0/0/error.json)
Traceback (most recent call last):
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.11.0', 'console_scripts', 'torchrun')())
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
    run(args)
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
    elastic_launch(
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
sockeye-train FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-08-24_14:53:53
  host      : cn135
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 30932)
  error_file: /gpfs/projects/DT/mtp/models/Senate_HOC-Debates-2019-11-15/nmt.sockeye-3/v3-HoC/en2fr/baseline.hf.sockeye-3.1.0_2nodes/tmp/torchelastic_zzyxq__8/159009_f51vopx0/attempt_0/0/error.json
  traceback : Traceback (most recent call last):
    File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/home/larkins/git/sockeye/sockeye/train.py", line 915, in train
      train_iter, eval_iter, config_data, source_vocabs, target_vocabs = create_data_iters_and_vocabs(
    File "/home/larkins/git/sockeye/sockeye/train.py", line 283, in create_data_iters_and_vocabs
      train_iter, validation_iter, data_config, source_vocabs, target_vocabs = data_io.get_prepared_data_iters(
    File "/home/larkins/git/sockeye/sockeye/data_io.py", line 840, in get_prepared_data_iters
      train_iter = ShardedParallelSampleIter(shard_fnames,
    File "/home/larkins/git/sockeye/sockeye/data_io.py", line 1710, in __init__
      self.reset()
    File "/home/larkins/git/sockeye/sockeye/data_io.py", line 1745, in reset
      self._load_shard()
    File "/home/larkins/git/sockeye/sockeye/data_io.py", line 1715, in _load_shard
      dataset = ParallelDataSet.load(self.shards_fnames[self.shard_index]).fill_up(self.bucket_batch_sizes,
    File "/home/larkins/git/sockeye/sockeye/data_io.py", line 1448, in fill_up
      target_num_samples = max(utils.all_gather_object(target_num_samples))
    File "/home/larkins/git/sockeye/sockeye/utils.py", line 616, in all_gather_object
      torch.distributed.all_gather_object(obj_list, obj)
    File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1661, in all_gather_object
      all_gather(output_tensors, input_tensor, group=group)
    File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2060, in all_gather
      work = default_pg.allgather([tensor_list], [tensor])
  RuntimeError: NCCL communicator was aborted on rank 0.  Original reason for failure was: NCCL error: unhandled system error, NCCL version 21.0.3
  ncclSystemError: System call (socket, malloc, munmap, etc) failed.

============================================================

real    30m30.189s
user    31m38.458s
sys     28m35.649s

Worker Log

Key Error messages

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 970) of binary: sockeye-train
WARNING:torch.distributed.elastic.rendezvous.dynamic_rendezvous:The node 'cn136_961_0' has failed to send a keep-alive heartbeat to the rendezvous '159009' due to an
error of type RendezvousTimeoutError.

Full Log

Started sockeye.train at Wed Aug 24 14:23:30 EDT 2022
time torchrun --no_python --nnodes=2 --nproc_per_node=1 --node_rank=1 --master_addr=10.10.0.135 --master_port=39906 --rdzv_id=159009 --rdzv_backend=c10d --rdzv_endpoi
nt=10.10.0.135:40017 sockeye-train --dist --quiet-secondary-workers --config=model_config.yaml
cn136:970:970 [1] NCCL INFO Bootstrap : Using ib0:10.11.0.136<0>
cn136:970:970 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
cn136:970:970 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
cn136:970:970 [1] NCCL INFO NET/Socket : Using [0]ib0:10.11.0.136<0>
cn136:970:970 [1] NCCL INFO Using network Socket
cn136:970:1023 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1
cn136:970:1023 [1] NCCL INFO Channel 00 : 0[89000] -> 1[8a000] [receive] via NET/Socket/0
cn136:970:1023 [1] NCCL INFO Channel 01 : 0[89000] -> 1[8a000] [receive] via NET/Socket/0
cn136:970:1023 [1] NCCL INFO Channel 00 : 1[8a000] -> 0[89000] [send] via NET/Socket/0
cn136:970:1023 [1] NCCL INFO Channel 01 : 1[8a000] -> 0[89000] [send] via NET/Socket/0
cn136:970:1023 [1] NCCL INFO Connected all rings
cn136:970:1023 [1] NCCL INFO Connected all trees
cn136:970:1023 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
cn136:970:1023 [1] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
cn136:970:1023 [1] NCCL INFO comm 0x7ffefc000fa0 rank 1 nranks 2 cudaDev 1 busId 8a000 - Init COMPLETE
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 970) of binary: sockeye-train
WARNING:torch.distributed.elastic.rendezvous.dynamic_rendezvous:The node 'cn136_961_0' has failed to send a keep-alive heartbeat to the rendezvous '159009' due to an
error of type RendezvousTimeoutError.
ERROR:torch.distributed.elastic.multiprocessing.errors.error_handler:no error file defined for parent, to copy child error file (/gpfs/projects/DT/mtp/models/Senate_HOC-Debates-2019-11-15/nmt.sockeye-3/v3-HoC/en2fr/baseline.hf.sockeye-3.1.0_2nodes/tmp/torchelastic_64urue0u/159009_mewv0m4k/attempt_0/0/error.json)
Traceback (most recent call last):
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.11.0', 'console_scripts', 'torchrun')())
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
    run(args)
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
    elastic_launch(
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
sockeye-train FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-08-24_14:53:47
  host      : cn136
  rank      : 1 (local_rank: 0)
  exitcode  : 1 (pid: 970)
  error_file: /gpfs/projects/DT/mtp/models/Senate_HOC-Debates-2019-11-15/nmt.sockeye-3/v3-HoC/en2fr/baseline.hf.sockeye-3.1.0_2nodes/tmp/torchelastic_64urue0u/159009_mewv0m4k/attempt_0/0/error.json
  traceback : Traceback (most recent call last):
    File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/home/larkins/git/sockeye/sockeye/train.py", line 915, in train
      train_iter, eval_iter, config_data, source_vocabs, target_vocabs = create_data_iters_and_vocabs(
    File "/home/larkins/git/sockeye/sockeye/train.py", line 283, in create_data_iters_and_vocabs
      train_iter, validation_iter, data_config, source_vocabs, target_vocabs = data_io.get_prepared_data_iters(
    File "/home/larkins/git/sockeye/sockeye/data_io.py", line 840, in get_prepared_data_iters
      train_iter = ShardedParallelSampleIter(shard_fnames,
    File "/home/larkins/git/sockeye/sockeye/data_io.py", line 1710, in __init__
      self.reset()
    File "/home/larkins/git/sockeye/sockeye/data_io.py", line 1742, in reset
      self.shards_fnames = utils.broadcast_object(self.shards_fnames)
    File "/home/larkins/git/sockeye/sockeye/utils.py", line 609, in broadcast_object
      torch.distributed.broadcast_object_list(obj_list, src=src)
    File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1869, in broadcast_object_list
      broadcast(object_sizes_tensor, src=src, group=group)
    File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1187, in broadcast
      work = default_pg.broadcast([tensor], opts)
  RuntimeError: [1] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '1', but store->get('1') got error: Socket Timeout
  Exception raised from recvBytes at /opt/conda/conda-bld/pytorch_1646755853042/work/torch/csrc/distributed/c10d/Utils.hpp:580 (most recent call first):
  frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7fff8e9961bd in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/lib/libc10.so)
  frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x6c (0x7fff8e99290c in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/lib/libc10.so)
  frame #2: c10d::TCPStore::doWait(c10::ArrayRef<std::string>, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x11f (0x7fffcd49bfef in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
  frame #3: c10d::TCPStore::doGet(std::string const&) + 0x21 (0x7fffcd49cf71 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
  frame #4: c10d::TCPStore::get(std::string const&) + 0x5b (0x7fffcd49cffb in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
  frame #5: c10d::PrefixStore::get(std::string const&) + 0x32 (0x7fffcd46eb42 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
  frame #6: c10d::PrefixStore::get(std::string const&) + 0x32 (0x7fffcd46eb42 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
  frame #7: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, c10d::OpType, std::string const&, int) + 0xe4 (0x7fff8fcf7834 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
  frame #8: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector<c10::Device, std::allocator<c10::Device> > const&, c10d::OpType, int, bool) + 0x1d9 (0x7fff8fcfb8c9 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
  frame #9: c10d::ProcessGroupNCCL::broadcast(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::BroadcastOptions const&) + 0x341 (0x7fff8fd06c21 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
  frame #10: <unknown function> + 0x801f49 (0x7fffd54bef49 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
  frame #11: <unknown function> + 0x1e5d37 (0x7fffd4ea2d37 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
  frame #12: PyCFunction_Call + 0x6e (0x55555568fe7e in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #13: _PyObject_MakeTpCall + 0x501 (0x555555678631 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #14: <unknown function> + 0x13bbfd (0x55555568fbfd in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #15: _PyEval_EvalFrameDefault + 0x48dc (0x555555673ffc in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #16: _PyEval_EvalCodeWithName + 0x2e1 (0x55555566e461 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #17: _PyFunction_Vectorcall + 0x18c (0x55555568033c in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #18: _PyEval_EvalFrameDefault + 0x10e8 (0x555555670808 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #19: _PyEval_EvalCodeWithName + 0x2e1 (0x55555566e461 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #20: _PyFunction_Vectorcall + 0x18c (0x55555568033c in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #21: _PyEval_EvalFrameDefault + 0x10e8 (0x555555670808 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #22: _PyEval_EvalCodeWithName + 0x2e1 (0x55555566e461 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #23: _PyFunction_Vectorcall + 0x18c (0x55555568033c in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #24: _PyEval_EvalFrameDefault + 0x48dc (0x555555673ffc in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #25: _PyEval_EvalCodeWithName + 0x9f6 (0x55555566eb76 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #26: _PyFunction_Vectorcall + 0x18c (0x55555568033c in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #27: _PyEval_EvalFrameDefault + 0x67d (0x55555566fd9d in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #28: _PyEval_EvalCodeWithName + 0x7d7 (0x55555566e957 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #29: _PyFunction_Vectorcall + 0x18c (0x55555568033c in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #30: <unknown function> + 0x1373b8 (0x55555568b3b8 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #31: _PyObject_MakeTpCall + 0x51c (0x55555567864c in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #32: _PyEval_EvalFrameDefault + 0x4ebf (0x5555556745df in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #33: _PyEval_EvalCodeWithName + 0x9f6 (0x55555566eb76 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #34: _PyFunction_Vectorcall + 0x18c (0x55555568033c in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #35: _PyEval_EvalFrameDefault + 0x10e8 (0x555555670808 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #36: _PyEval_EvalCodeWithName + 0x9f6 (0x55555566eb76 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #37: _PyFunction_Vectorcall + 0x18c (0x55555568033c in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #38: _PyEval_EvalFrameDefault + 0x10e8 (0x555555670808 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #39: _PyEval_EvalCodeWithName + 0x2e1 (0x55555566e461 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #40: _PyFunction_Vectorcall + 0x18c (0x55555568033c in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #41: PyObject_Call + 0x2d2 (0x555555692172 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #42: _PyEval_EvalFrameDefault + 0x2150 (0x555555671870 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #43: _PyEval_EvalCodeWithName + 0x9f6 (0x55555566eb76 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #44: _PyFunction_Vectorcall + 0x18c (0x55555568033c in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #45: _PyEval_EvalFrameDefault + 0x38b (0x55555566faab in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #46: _PyFunction_Vectorcall + 0xf6 (0x5555556802a6 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #47: _PyEval_EvalFrameDefault + 0x38b (0x55555566faab in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #48: _PyEval_EvalCodeWithName + 0x2e1 (0x55555566e461 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #49: PyEval_EvalCodeEx + 0x39 (0x55555572dde9 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #50: PyEval_EvalCode + 0x1b (0x55555572ddab in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #51: <unknown function> + 0x1fa903 (0x55555574e903 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #52: <unknown function> + 0x1f98e3 (0x55555574d8e3 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #53: <unknown function> + 0x99f2f (0x5555555edf2f in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #54: PyRun_SimpleFileExFlags + 0x364 (0x5555555eda23 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #55: <unknown function> + 0x8d0ac (0x5555555e10ac in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #56: Py_BytesMain + 0x39 (0x555555722219 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #57: __libc_start_main + 0xf5 (0x7ffff6f02555 in /lib64/libc.so.6)
  frame #58: <unknown function> + 0x1ce125 (0x555555722125 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)

============================================================

real    30m30.101s
user    0m6.278s
sys     0m11.023s
NRCGavin commented 1 year ago

I discovered a miss configuration of the headnode's firewall today that probably caused the timeout error you saw in the logs.

Please try again when you have time and report back.