Bluefog-Lib / bluefog

Distributed and decentralized training framework for PyTorch over graph
https://bluefog-lib.github.io/bluefog/
Apache License 2.0
291 stars 71 forks source link

Bluefog didn't throw an error when CUDA memory is not enough. #32

Closed hanbinhu closed 4 years ago

hanbinhu commented 4 years ago

When CUDA memory is not enough, horovod will throw the following exception RuntimeError: CUDA error: out of memory However, our library will freeze and report nothing.

Bluefog-Lib commented 4 years ago

maybe related to issue #44

Bluefog-Lib commented 4 years ago

One finding is some processes silently shut down without warning while other processes keep running until collective communication ops are encountered and froze forever.

BichengYing commented 4 years ago

After adding negotiating stage, it should be solved