Closed jupinter closed 3 months ago
are you training on libritts? this is strange, never meet such error, I believe the training will continue when ddp timeout happen
same error
代码更新下,可能是早期的代码barrier异常导致embedding没有用spk_embedding/utt_embedding导致的。
我也遇到了同样的问题,更新了代码没有解决,很困惑
if info_dict["batch_idx"] != 0:
# we try to join all rank in both ddp and deepspeed mode, in case different rank has different lr
try:
dist.monitored_barrier(group=group_join,
timeout=group_join.options._timeout)
return False
except RuntimeError as e:
logging.info("Detected uneven workload distribution: {}\n".format(e) +
"Break current worker to manually join all workers, " +
"world_size {}, current rank {}, current local_rank {}\n".
format(world_size, rank, local_rank))
return False
else:
return False
异常的时候return False. 还有其他barrier的地方注意捕获异常
try:
dist.barrier()
except RuntimeError as e:
logging.info('except RuntimeError as e: {}'.format(e))
可以试试
好的,想问一下如果dist.monitored_barrier失败,会有什么问题吗?
这个问题是多卡数据分布式实现有些问题 单卡目前没这个问题。
找到解决方案,把dataset中的partition设为False
好的,想问一下如果dist.monitored_barrier失败,会有什么问题吗?
it is ok when dist.monitored_barrier return false, it will start next epoc training
好的,想问一下如果dist.monitored_barrier失败,会有什么问题吗?
it is ok when dist.monitored_barrier return false, it will start next epoc training
In my exp, which use my own dataset to fintune the model, it's actually caused by data distributed problem(distributedsampler has some bug) when use multi gpu,although the program doesn't break, it will restart the next epoch and cause problem again, which result the model only use very little data to train.
找到解决方案,把dataset中的partition设为False
但是每个设备会在相同的数据上训练
I also met with the same problem. I tried to train the model in a distributed manner (Two nodes with 8 GPUs respectively), but met with the following errors:
INFO Detected uneven workload distribution: Rank 7 successfully reached monitoredBarrier, but received errors while waiting for send/recv from rank 0. Please check rank 0 logs for faulty rank.
1421 Original exception:
1422 [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 30000ms for recv operation to complete
1423 Break current worker to manually join all workers, world_size 16, current rank 7, current local_rank 7
在4张A800线上运行到1700步运行sft时,采用torch_ddp/deepspeed都会存在下面的问题:
[E ProcessGroupGloo.cpp:138] Rank 3 successfully reached monitoredBarrier, but received errors while waiting for send/recv from rank 0. Please check rank 0 logs for faulty rank. [E ProcessGroupGloo.cpp:138] [Rank 0]: Rank 2 failed to pass monitoredBarrier in 30000 ms [E ProcessGroupGloo.cpp:138] Rank 1 successfully reached monitoredBarrier, but received errors while waiting for send/recv from rank 0. Please check rank 0 logs for faulty rank. 2024-07-21 00:11:29,529 INFO Detected uneven workload distribution: Rank 3 successfully reached monitoredBarrier, but received errors while waiting for send/recv from rank 0. Please check rank 0 logs for faulty rank. Original exception: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 30000ms for recv operation to complete Break current worker to manually join all workers, world_size 4, current rank 3, current local_rank 3
这块有没有什么思路和建议?感谢!