when I use the accelerate config commands,I set the parameter as follows:
In which compute environment are you running?
This machine
------------------------------------------------------------------------------------Which type of machine are you using?
multi-GPU
How many different machines will you use (use more than 1 for multi-node training)? [1]: 1
Should distributed operations be checked while running for errors? This can avoid timeout issues but will be slower. [yes/no]: yes
Do you wish to optimize your script with torch dynamo?[yes/no]:no
Do you want to use DeepSpeed? [yes/no]: no
Do you want to use FullyShardedDataParallel? [yes/no]: no
Do you want to use Megatron-LM ? [yes/no]: no
How many GPU(s) should be used for distributed training? [1]:2
What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:0,1
when I input the accelerate launch main.py --temperature 0.2 --n_samples 1
The programming has been struck Selected :
Tasks: ['humaneval']
Loading model in fp32
Loading model via these GPUs & max memories: {0: '40GB', 1: '40GB'}
/root/anaconda/envs/bigcode/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py:479: FutureWarning: The use_auth_token argument is deprecated and will be removed in v5 of Transformers.
warnings.warn(
Loading checkpoint shards: 100%|██████████████████████| 3/3 [00:28<00:00, 9.36s/it]
Loading checkpoint shards: 100%|██████████████████████| 3/3 [00:28<00:00, 9.39s/it]
/root/anaconda/envs/bigcode/lib/python3.9/site-packages/transformers/models/auto/tokenization_auto.py:640: FutureWarning: The use_auth_token argument is deprecated and will be removed in v5 of Transformers.
warnings.warn(
/root/anaconda/envs/bigcode/lib/python3.9/site-packages/transformers/models/auto/tokenization_auto.py:640: FutureWarning: The use_auth_token argument is deprecated and will be removed in v5 of Transformers.
warnings.warn(
number of problems for this task is 164
0%| | 0/82 [00:00<?, ?it/s]
RuntimeError: NCCL communicator was aborted on rank 0. Original reason for failure was: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1801145 milliseconds before timing out.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 63148) of binary: /root/anaconda/envs/bigcode/bin/python
I check the network and hardware and reinstalling and reinstalling accelerate.I found that after loading one piece of data, an error will be reported and I have set the export NCCL P2P DISABLE=1,
warnings.warn(
number of problems for this task is 164
1%|▋ | 1/82 [00:54<1:13:20, 54.33s/it]
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 61144 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 1 (pid: 61145) of binary: /root/anaconda/envs/bigcode1/bin/python
Traceback (most recent call last):
File "/root/anaconda/envs/bigcode1/bin/accelerate", line 8, in
sys.exit(main())
File "/root/anaconda/envs/bigcode1/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/root/anaconda/envs/bigcode1/lib/python3.8/site-packages/accelerate/commands/launch.py", line 977, in launch_command
multi_gpu_launcher(args)
File "/root/anaconda/envs/bigcode1/lib/python3.8/site-packages/accelerate/commands/launch.py", line 646, in multi_gpu_launcher
distrib_run.run(args)
File "/root/anaconda/envs/bigcode1/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/root/anaconda/envs/bigcode1/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/anaconda/envs/bigcode1/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
01.py FAILED
Failures:
------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-10-22_12:57:04
host : rt-res-public9-6f8f8bd4fc-92zc9
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 61145)
error_file:
traceback : Signal 6 (SIGABRT) received by PID 61145
======================================================
(bigcode1) root@rt-res-public9-6f8f8bd4fc
(bigcode1) root@rt-res-public9-6f8f8bd4fc
(bigcode1) root@rt-res-public9-6f8f8bd4fc-92zc9:/public9_data/zs/zs/2080ti-bigcode# accelerate launch 01.py --temperature 0.2 --n_samples 1zc9
Loading model in fp32
Loading model via these GPUs & max memories: {0: '30GB', 1: '30GB'}
/root/anaconda/envs/bigcode1/lib/python3.8/site-packages/transformers/models/auto/auto_factory.py:472: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers.
warnings.warn(
Selected Tasks: ['humaneval']
Loading model in fp32
Loading model via these GPUs & max memories: {0: '30GB', 1: '30GB'}
/root/anaconda/envs/bigcode1/lib/python3.8/site-packages/transformers/models/auto/auto_factory.py:472: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers.
warnings.warn(
/root/anaconda/envs/bigcode1/lib/python3.8/site-packages/transformers/models/auto/tokenization_auto.py:655: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers.
warnings.warn(
/root/anaconda/envs/bigcode1/lib/python3.8/site-packages/transformers/models/auto/tokenization_auto.py:655: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers.
warnings.warn(
number of problems for this task is 164
1%|▊ | 1/82 [00:55<1:14:43, 55.35s/it]
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 61469 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 1 (pid: 61470) of binary: /root/anaconda/envs/bigcode1/bin/python
Traceback (most recent call last):
when I use the accelerate config commands,I set the parameter as follows: In which compute environment are you running? This machine
------------------------------------------------------------------------------------Which type of machine are you using?
multi-GPU
How many different machines will you use (use more than 1 for multi-node training)? [1]: 1
Should distributed operations be checked while running for errors? This can avoid timeout issues but will be slower. [yes/no]: yes
Do you wish to optimize your script with torch dynamo?[yes/no]:no
Do you want to use DeepSpeed? [yes/no]: no
Do you want to use FullyShardedDataParallel? [yes/no]: no Do you want to use Megatron-LM ? [yes/no]: no How many GPU(s) should be used for distributed training? [1]:2 What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:0,1
when I input the accelerate launch main.py --temperature 0.2 --n_samples 1
The programming has been struck Selected : Tasks: ['humaneval'] Loading model in fp32 Loading model via these GPUs & max memories: {0: '40GB', 1: '40GB'} /root/anaconda/envs/bigcode/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py:479: FutureWarning: The
use_auth_token
argument is deprecated and will be removed in v5 of Transformers. warnings.warn( Loading checkpoint shards: 100%|██████████████████████| 3/3 [00:28<00:00, 9.36s/it] Loading checkpoint shards: 100%|██████████████████████| 3/3 [00:28<00:00, 9.39s/it] /root/anaconda/envs/bigcode/lib/python3.9/site-packages/transformers/models/auto/tokenization_auto.py:640: FutureWarning: Theuse_auth_token
argument is deprecated and will be removed in v5 of Transformers. warnings.warn( /root/anaconda/envs/bigcode/lib/python3.9/site-packages/transformers/models/auto/tokenization_auto.py:640: FutureWarning: Theuse_auth_token
argument is deprecated and will be removed in v5 of Transformers. warnings.warn( number of problems for this task is 164 0%| | 0/82 [00:00<?, ?it/s]RuntimeError: NCCL communicator was aborted on rank 0. Original reason for failure was: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1801145 milliseconds before timing out. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 63148) of binary: /root/anaconda/envs/bigcode/bin/python