fastnlp / CPT

CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Generation
479 stars 70 forks source link

sh run_pretrain_bart.sh 报错 #69

Open jackychancjcjcj opened 1 year ago

jackychancjcjcj commented 1 year ago

您好,请问下运行代码时报一下错误,我确认端口是没有被占用的,麻烦您帮忙解答一下哈。我的设置如下: GPUS_PER_NODE=2

Change for multinode config

MASTER_ADDR=localhost MASTER_PORT=8514 NNODES=1 NODE_RANK=0 WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))

报错:

initializing torch distributed ... Traceback (most recent call last): File "pretrain_bart.py", line 133, in pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/CPT/pretrain/megatron/training.py", line 94, in pretrain initialize_megatron(extra_args_provider=extra_args_provider, File "/CPT/pretrain/megatron/initialize.py", line 78, in initialize_megatron finish_mpu_init() File "/CPT/pretrain/megatron/initialize.py", line 59, in finish_mpu_init _initialize_distributed() File "/CPT/pretrain/megatron/initialize.py", line 181, in _initialize_distributed torch.distributed.init_process_group( File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 520, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 142, in _tcp_rendezvous_handler store = TCPStore(result.hostname, result.port, world_size, start_daemon, timeout) RuntimeError: Address already in use

shivanraptor commented 2 weeks ago

Do you really need the multi-node? If not, you can just disable the multi-node features.