Bluefog-Lib / bluefog

Distributed and decentralized training framework for PyTorch over graph
https://bluefog-lib.github.io/bluefog/
Apache License 2.0
291 stars 71 forks source link

Need A Barrier before GPU allgather-related Ops #2

Closed BichengYing closed 4 years ago

BichengYing commented 4 years ago

Like One-sided Ops, we need to use allreduce ops as barrier before GPU allgahter related Ops in test. Other, a segmentation fault will pop out.

However, CPU ops do not need that. Also if all ops is set at "cuda:0", this barrier is not needed either. So question, why it is necessary in multiple GPU scenarios?

BichengYing commented 4 years ago

The problem is clear now. It is because the restored device is wrong. Writing a fixing CL now.

BichengYing commented 4 years ago

Thanks to @hanbinhu, this is resolved!