ALLISWELL8 commented 11 months ago

when I use the accelerate config commands,I set the parameter as follows: In which compute environment are you running? This machine
------------------------------------------------------------------------------------Which type of machine are you using?
multi-GPU
How many different machines will you use (use more than 1 for multi-node training)? [1]: 1
Should distributed operations be checked while running for errors? This can avoid timeout issues but will be slower. [yes/no]: yes
Do you wish to optimize your script with torch dynamo?[yes/no]:no
Do you want to use DeepSpeed? [yes/no]: no
Do you want to use FullyShardedDataParallel? [yes/no]: no Do you want to use Megatron-LM ? [yes/no]: no How many GPU(s) should be used for distributed training? [1]:2 What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:0,1

when I input the accelerate launch main.py --temperature 0.2 --n_samples 1

The programming has been struck Selected ： Tasks: ['humaneval'] Loading model in fp32 Loading model via these GPUs & max memories: {0: '40GB', 1: '40GB'} /root/anaconda/envs/bigcode/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py:479: FutureWarning: The use_auth_token argument is deprecated and will be removed in v5 of Transformers. warnings.warn( Loading checkpoint shards: 100%|██████████████████████| 3/3 [00:28<00:00, 9.36s/it] Loading checkpoint shards: 100%|██████████████████████| 3/3 [00:28<00:00, 9.39s/it] /root/anaconda/envs/bigcode/lib/python3.9/site-packages/transformers/models/auto/tokenization_auto.py:640: FutureWarning: The use_auth_token argument is deprecated and will be removed in v5 of Transformers. warnings.warn( /root/anaconda/envs/bigcode/lib/python3.9/site-packages/transformers/models/auto/tokenization_auto.py:640: FutureWarning: The use_auth_token argument is deprecated and will be removed in v5 of Transformers. warnings.warn( number of problems for this task is 164 0%| | 0/82 [00:00<?, ?it/s]

RuntimeError: NCCL communicator was aborted on rank 0. Original reason for failure was: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1801145 milliseconds before timing out. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 63148) of binary: /root/anaconda/envs/bigcode/bin/python

loubnabnl commented 11 months ago

Hello, the error doesn't seem related to the evaluation harness, maybe check if it's a network/hardware failure or reinstalling accelerate

ALLISWELL8 commented 11 months ago

I check the network and hardware and reinstalling and reinstalling accelerate.I found that after loading one piece of data, an error will be reported and I have set the export NCCL P2P DISABLE=1, warnings.warn( number of problems for this task is 164 1%|▋ | 1/82 [00:54<1:13:20, 54.33s/it] WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 61144 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 1 (pid: 61145) of binary: /root/anaconda/envs/bigcode1/bin/python Traceback (most recent call last): File "/root/anaconda/envs/bigcode1/bin/accelerate", line 8, in sys.exit(main()) File "/root/anaconda/envs/bigcode1/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main args.func(args) File "/root/anaconda/envs/bigcode1/lib/python3.8/site-packages/accelerate/commands/launch.py", line 977, in launch_command multi_gpu_launcher(args) File "/root/anaconda/envs/bigcode1/lib/python3.8/site-packages/accelerate/commands/launch.py", line 646, in multi_gpu_launcher distrib_run.run(args) File "/root/anaconda/envs/bigcode1/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "/root/anaconda/envs/bigcode1/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/root/anaconda/envs/bigcode1/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

01.py FAILED

Failures:

------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-10-22_12:57:04 host : rt-res-public9-6f8f8bd4fc-92zc9 rank : 1 (local_rank: 1) exitcode : -6 (pid: 61145) error_file: traceback : Signal 6 (SIGABRT) received by PID 61145 ====================================================== (bigcode1) root@rt-res-public9-6f8f8bd4fc (bigcode1) root@rt-res-public9-6f8f8bd4fc (bigcode1) root@rt-res-public9-6f8f8bd4fc-92zc9:/public9_data/zs/zs/2080ti-bigcode# accelerate launch 01.py --temperature 0.2 --n_samples 1zc9 Loading model in fp32 Loading model via these GPUs & max memories: {0: '30GB', 1: '30GB'} /root/anaconda/envs/bigcode1/lib/python3.8/site-packages/transformers/models/auto/auto_factory.py:472: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. warnings.warn( Selected Tasks: ['humaneval'] Loading model in fp32 Loading model via these GPUs & max memories: {0: '30GB', 1: '30GB'} /root/anaconda/envs/bigcode1/lib/python3.8/site-packages/transformers/models/auto/auto_factory.py:472: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. warnings.warn( /root/anaconda/envs/bigcode1/lib/python3.8/site-packages/transformers/models/auto/tokenization_auto.py:655: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. warnings.warn( /root/anaconda/envs/bigcode1/lib/python3.8/site-packages/transformers/models/auto/tokenization_auto.py:655: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. warnings.warn( number of problems for this task is 164 1%|▊ | 1/82 [00:55<1:14:43, 55.35s/it] WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 61469 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 1 (pid: 61470) of binary: /root/anaconda/envs/bigcode1/bin/python Traceback (most recent call last):

bigcode-project / bigcode-evaluation-harness

Multi card operation large model #147

01.py FAILED