Detected uneven workload distribution

jupinter commented 3 months ago

在4张A800线上运行到1700步运行sft时，采用torch_ddp/deepspeed都会存在下面的问题：

[E ProcessGroupGloo.cpp:138] Rank 3 successfully reached monitoredBarrier, but received errors while waiting for send/recv from rank 0. Please check rank 0 logs for faulty rank. [E ProcessGroupGloo.cpp:138] [Rank 0]: Rank 2 failed to pass monitoredBarrier in 30000 ms [E ProcessGroupGloo.cpp:138] Rank 1 successfully reached monitoredBarrier, but received errors while waiting for send/recv from rank 0. Please check rank 0 logs for faulty rank. 2024-07-21 00:11:29,529 INFO Detected uneven workload distribution: Rank 3 successfully reached monitoredBarrier, but received errors while waiting for send/recv from rank 0. Please check rank 0 logs for faulty rank. Original exception: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 30000ms for recv operation to complete Break current worker to manually join all workers, world_size 4, current rank 3, current local_rank 3

这块有没有什么思路和建议？感谢！

aluminumbox commented 3 months ago

are you training on libritts? this is strange, never meet such error, I believe the training will continue when ddp timeout happen

yuzuda283 commented 3 months ago

same error

jupinter commented 3 months ago

代码更新下，可能是早期的代码barrier异常导致embedding没有用spk_embedding/utt_embedding导致的。

zyy-fc commented 3 months ago

我也遇到了同样的问题，更新了代码没有解决，很困惑

jupinter commented 3 months ago

if info_dict["batch_idx"] != 0:
    # we try to join all rank in both ddp and deepspeed mode, in case different rank has different lr
    try:
        dist.monitored_barrier(group=group_join,
                               timeout=group_join.options._timeout)
        return False
    except RuntimeError as e:
        logging.info("Detected uneven workload distribution: {}\n".format(e) +
                     "Break current worker to manually join all workers, " +
                     "world_size {}, current rank {}, current local_rank {}\n".
                     format(world_size, rank, local_rank))
        return False
else:
    return False

异常的时候return False. 还有其他barrier的地方注意捕获异常

try:
    dist.barrier()
except RuntimeError as e:
    logging.info('except RuntimeError as e: {}'.format(e))

可以试试

zyy-fc commented 2 months ago

好的，想问一下如果dist.monitored_barrier失败，会有什么问题吗？

yuzuda283 commented 2 months ago

这个问题是多卡数据分布式实现有些问题单卡目前没这个问题。

zyy-fc commented 2 months ago

找到解决方案，把dataset中的partition设为False

aluminumbox commented 2 months ago

好的，想问一下如果dist.monitored_barrier失败，会有什么问题吗？

it is ok when dist.monitored_barrier return false, it will start next epoc training

yuzuda283 commented 2 months ago

好的，想问一下如果dist.monitored_barrier失败，会有什么问题吗？

it is ok when dist.monitored_barrier return false, it will start next epoc training

In my exp, which use my own dataset to fintune the model, it's actually caused by data distributed problem(distributedsampler has some bug) when use multi gpu，although the program doesn't break, it will restart the next epoch and cause problem again, which result the model only use very little data to train.

Mahaotian1 commented 2 months ago

找到解决方案，把dataset中的partition设为False

356954477-43da8fe5-3959-465a-998a-b6f54cb6ca58 但是每个设备会在相同的数据上训练

huskyachao commented 2 months ago

I also met with the same problem. I tried to train the model in a distributed manner (Two nodes with 8 GPUs respectively), but met with the following errors:

INFO Detected uneven workload distribution: Rank 7 successfully reached monitoredBarrier, but received errors while waiting for send/recv from rank 0. Please check rank 0 logs for faulty rank.
  1421   Original exception: 
  1422  [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 30000ms for recv operation to complete
  1423  Break current worker to manually join all workers, world_size 16, current rank 7, current local_rank 7

FunAudioLLM / CosyVoice

Detected uneven workload distribution #180