Open CriDora opened 2 weeks ago
修改train_utils.py中train_dataset partition=False试试,如果成功,那说明是数据不均衡导致
修改train_utils.py中train_dataset partition=False试试,如果成功,那说明是数据不均衡导致
十分感谢!修改为False后不会再出现这样的问题。不过请问数据不均衡是指哪方面的?
修改train_utils.py中train_dataset partition=False试试,如果成功,那说明是数据不均衡导致
十分感谢!修改为False后不会再出现这样的问题。不过请问数据不均衡是指哪方面的?
比如你有10个parquet文件,2个gpu,partition会给每个worker分5个文件,但如果你某个worker正好里面所有数据全被过滤掉,那么就回有个worker一点数据也没有,那么就会超时
Hello, thank you for your open source. When I train on my own dataset, an error message will be reported at the end of 1 epoch training. The error message is as follows:
2024-10-18 20:47:35,180 DEBUG TRAIN Batch 15/37600 loss 1.247700 acc 0.343740 lr 0.00055931 grad_norm 0.152029 rank 1 2024-10-18 20:48:27,683 DEBUG TRAIN Batch 15/37700 loss 1.389090 acc 0.288246 lr 0.00055927 grad_norm 0.190002 rank 1 2024-10-18 20:48:27,683 DEBUG TRAIN Batch 15/37700 loss 1.543046 acc 0.267136 lr 0.00055927 grad_norm 0.190002 rank 0 [E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=377286, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805402 milliseconds before timing out. Traceback (most recent call last): File "cosyvoice/bin/train.py", line 144, in
main()
File "/home/lhw523/anaconda3/envs/cosyvoice/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, *kwargs)
File "cosyvoice/bin/train.py", line 139, in main
executor.train_one_epoc(model, optimizer, scheduler, train_data_loader, cv_data_loader, writer, info_dict, group_join)
File "/s1home/lhw523/CosyVoice/cosyvoice/utils/executor.py", line 78, in train_one_epoc
self.step += 1
File "/home/lhw523/anaconda3/envs/cosyvoice/lib/python3.8/site-packages/torch/distributed/algorithms/join.py", line 276, in exit
join_hook.main_hook()
File "/home/lhw523/anaconda3/envs/cosyvoice/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 225, in main_hook
work = ddp._check_global_requires_backward_grad_sync(
File "/home/lhw523/anaconda3/envs/cosyvoice/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1248, in _check_global_requires_backward_grad_sync
work = dist.all_reduce(
File "/home/lhw523/anaconda3/envs/cosyvoice/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1451, in wrapper
return func(args, **kwargs)
File "/home/lhw523/anaconda3/envs/cosyvoice/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1702, in all_reduce
work = group.allreduce([tensor], opts)
RuntimeError: NCCL communicator was aborted on rank 0. Original reason for failure was: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=377286, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805402 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=377286, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805402 milliseconds before timing out.
Fatal Python error: Aborted
Thread 0x00007f8730fc9700 (most recent call first):