initializing torch distributed ...
Traceback (most recent call last):
File "pretrain_bart.py", line 133, in
pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
File "/CPT/pretrain/megatron/training.py", line 94, in pretrain
initialize_megatron(extra_args_provider=extra_args_provider,
File "/CPT/pretrain/megatron/initialize.py", line 78, in initialize_megatron
finish_mpu_init()
File "/CPT/pretrain/megatron/initialize.py", line 59, in finish_mpu_init
_initialize_distributed()
File "/CPT/pretrain/megatron/initialize.py", line 181, in _initialize_distributed
torch.distributed.init_process_group(
File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 520, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 142, in _tcp_rendezvous_handler
store = TCPStore(result.hostname, result.port, world_size, start_daemon, timeout)
RuntimeError: Address already in use
您好,请问下运行代码时报一下错误,我确认端口是没有被占用的,麻烦您帮忙解答一下哈。我的设置如下: GPUS_PER_NODE=2
Change for multinode config
MASTER_ADDR=localhost MASTER_PORT=8514 NNODES=1 NODE_RANK=0 WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
报错: