Open braceal opened 3 years ago
The line failing: https://github.com/DeepDriveMD/DeepDriveMD-pipeline/blob/dbg/integration/deepdrivemd/models/aae/train.py#L181
The Traceback:
Traceback (most recent call last): File "/p/gpfs1/brace3/src/DeepDriveMD-pipeline/deepdrivemd/models/aae/train.py", line 390, in <module> main(cfg, args.encoder_gpu, args.generator_gpu, args.decoder_gpu, args.distributed) File "/p/gpfs1/brace3/src/DeepDriveMD-pipeline/deepdrivemd/models/aae/train.py", line 180, in main dist.init_process_group(backend="nccl", init_method="env://") File "/g/g15/brace3/.conda/envs/conda-pytorch/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 422, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "/g/g15/brace3/.conda/envs/conda-pytorch/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 172, in _env_rendezvous_handler store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout) RuntimeError: connect() timed out.
Useful references:
This was likely a Lassen specific issue.
The line failing: https://github.com/DeepDriveMD/DeepDriveMD-pipeline/blob/dbg/integration/deepdrivemd/models/aae/train.py#L181
The Traceback:
Useful references: