Training LLM to the end of an epoch, it seems that the gpu communication timed out

CriDora commented 2 weeks ago

Hello, thank you for your open source. When I train on my own dataset, an error message will be reported at the end of 1 epoch training. The error message is as follows:

2024-10-18 20:47:35,180 DEBUG TRAIN Batch 15/37600 loss 1.247700 acc 0.343740 lr 0.00055931 grad_norm 0.152029 rank 1 2024-10-18 20:48:27,683 DEBUG TRAIN Batch 15/37700 loss 1.389090 acc 0.288246 lr 0.00055927 grad_norm 0.190002 rank 1 2024-10-18 20:48:27,683 DEBUG TRAIN Batch 15/37700 loss 1.543046 acc 0.267136 lr 0.00055927 grad_norm 0.190002 rank 0 [E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=377286, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805402 milliseconds before timing out. Traceback (most recent call last): File "cosyvoice/bin/train.py", line 144, in main() File "/home/lhw523/anaconda3/envs/cosyvoice/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, *kwargs) File "cosyvoice/bin/train.py", line 139, in main executor.train_one_epoc(model, optimizer, scheduler, train_data_loader, cv_data_loader, writer, info_dict, group_join) File "/s1home/lhw523/CosyVoice/cosyvoice/utils/executor.py", line 78, in train_one_epoc self.step += 1 File "/home/lhw523/anaconda3/envs/cosyvoice/lib/python3.8/site-packages/torch/distributed/algorithms/join.py", line 276, in exit join_hook.main_hook() File "/home/lhw523/anaconda3/envs/cosyvoice/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 225, in main_hook work = ddp._check_global_requires_backward_grad_sync( File "/home/lhw523/anaconda3/envs/cosyvoice/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1248, in _check_global_requires_backward_grad_sync work = dist.all_reduce( File "/home/lhw523/anaconda3/envs/cosyvoice/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1451, in wrapper return func(args, **kwargs) File "/home/lhw523/anaconda3/envs/cosyvoice/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1702, in all_reduce work = group.allreduce([tensor], opts) RuntimeError: NCCL communicator was aborted on rank 0. Original reason for failure was: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=377286, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805402 milliseconds before timing out. [E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down. terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=377286, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805402 milliseconds before timing out. Fatal Python error: Aborted

Thread 0x00007f8730fc9700 (most recent call first):

Thread 0x00007f87317ca700 (most recent call first): Thread 0x00007f87fdc3a700 (most recent call first): File "/home/lhw523/anaconda3/envs/cosyvoice/lib/python3.8/threading.py", line 306 in wait File "/home/lhw523/anaconda3/envs/cosyvoice/lib/python3.8/queue.py", line 179 in get File "/home/lhw523/anaconda3/envs/cosyvoice/lib/python3.8/site-packages/tensorboard/summary/writer/event_file_writer.py", line 269 in _run File "/home/lhw523/anaconda3/envs/cosyvoice/lib/python3.8/site-packages/tensorboard/summary/writer/event_file_writer.py", line 244 in run File "/home/lhw523/anaconda3/envs/cosyvoice/lib/python3.8/threading.py", line 932 in _bootstrap_inner File "/home/lhw523/anaconda3/envs/cosyvoice/lib/python3.8/threading.py", line 890 in _bootstrap Thread 0x00007f88e5c544c0 (most recent call first): File "/home/lhw523/anaconda3/envs/cosyvoice/lib/python3.8/subprocess.py", line 1015 in communicate File "/home/lhw523/anaconda3/envs/cosyvoice/lib/python3.8/subprocess.py", line 495 in run File "/home/lhw523/anaconda3/envs/cosyvoice/lib/python3.8/subprocess.py", line 415 in check_output File "/home/lhw523/anaconda3/envs/cosyvoice/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 27 in is_nfs_path File "/home/lhw523/anaconda3/envs/cosyvoice/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 45 in default_cache_dir File "/home/lhw523/anaconda3/envs/cosyvoice/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 82 in __init__ File "/home/lhw523/anaconda3/envs/cosyvoice/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 173 in _update_autotune_table File "/home/lhw523/anaconda3/envs/cosyvoice/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 450 in _update_autotune_table File "/home/lhw523/anaconda3/envs/cosyvoice/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 473 in matmul_ext_update_autotune_table [E ProcessGroupGloo.cpp:138] Rank 1 successfully reached monitoredBarrier, but received errors while waiting for send/recv from rank 0. Please check rank 0 logs for faulty rank. 2024-10-18 21:18:49,413 INFO Detected uneven workload distribution: Rank 1 successfully reached monitoredBarrier, but received errors while waiting for send/recv from rank 0. Please check rank 0 logs for faulty rank. Original exception: [../third_party/gloo/gloo/transport/tcp/pair.cc:598] Connection closed by peer [127.0.0.1]:5860 Break current worker to manually join all workers, world_size 2, current rank 1, current local_rank 1 WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 146249 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 146248) of binary: /home/lhw523/anaconda3/envs/cosyvoice/bin/python ERROR:torch.distributed.elastic.multiprocessing.errors.error_handler:no error file defined for parent, to copy child error file (/tmp/torchelastic_sekn5ont/1986_io_0ag_u/attempt_0/0/error.json) Traceback (most recent call last): File "/home/lhw523/anaconda3/envs/cosyvoice/bin/torchrun", line 8, in sys.exit(main()) File "/home/lhw523/anaconda3/envs/cosyvoice/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper return f(*args, **kwargs) File "/home/lhw523/anaconda3/envs/cosyvoice/lib/python3.8/site-packages/torch/distributed/run.py", line 794, in main run(args) File "/home/lhw523/anaconda3/envs/cosyvoice/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/home/lhw523/anaconda3/envs/cosyvoice/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/lhw523/anaconda3/envs/cosyvoice/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ cosyvoice/bin/train.py FAILED ------------------------------------------------------------ Failures: ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-10-18_21:18:48 host : xju-aslp4 rank : 0 (local_rank: 0) exitcode : -6 (pid: 146248) error_file: /tmp/torchelastic_sekn5ont/1986_io_0ag_u/attempt_0/0/error.json traceback : Traceback (most recent call last): File "/home/lhw523/anaconda3/envs/cosyvoice/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper return f(*args, **kwargs) File "cosyvoice/bin/train.py", line 139, in main executor.train_one_epoc(model, optimizer, scheduler, train_data_loader, cv_data_loader, writer, info_dict, group_join) File "/s1home/lhw523/CosyVoice/cosyvoice/utils/executor.py", line 78, in train_one_epoc self.step += 1 File "/home/lhw523/anaconda3/envs/cosyvoice/lib/python3.8/site-packages/torch/distributed/algorithms/join.py", line 276, in __exit__ join_hook.main_hook() File "/home/lhw523/anaconda3/envs/cosyvoice/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 225, in main_hook work = ddp._check_global_requires_backward_grad_sync( File "/home/lhw523/anaconda3/envs/cosyvoice/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1248, in _check_global_requires_backward_grad_sync work = dist.all_reduce( File "/home/lhw523/anaconda3/envs/cosyvoice/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1451, in wrapper return func(*args, **kwargs) File "/home/lhw523/anaconda3/envs/cosyvoice/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1702, in all_reduce work = group.allreduce([tensor], opts) RuntimeError: NCCL communicator was aborted on rank 0. Original reason for failure was: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=377286, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805402 milliseconds before timing out. ============================================================

aluminumbox commented 2 weeks ago

修改train_utils.py中train_dataset partition=False试试，如果成功，那说明是数据不均衡导致

CriDora commented 2 weeks ago

修改train_utils.py中train_dataset partition=False试试，如果成功，那说明是数据不均衡导致

十分感谢！修改为False后不会再出现这样的问题。不过请问数据不均衡是指哪方面的？

aluminumbox commented 2 weeks ago

修改train_utils.py中train_dataset partition=False试试，如果成功，那说明是数据不均衡导致

十分感谢！修改为False后不会再出现这样的问题。不过请问数据不均衡是指哪方面的？

比如你有10个parquet文件，2个gpu，partition会给每个worker分5个文件，但如果你某个worker正好里面所有数据全被过滤掉，那么就回有个worker一点数据也没有，那么就会超时

FunAudioLLM / CosyVoice

Training LLM to the end of an epoch, it seems that the gpu communication timed out #517