Closed hanbinhu closed 4 years ago
maybe related to issue #44
One finding is some processes silently shut down without warning while other processes keep running until collective communication ops are encountered and froze forever.
After adding negotiating stage, it should be solved
When CUDA memory is not enough, horovod will throw the following exception
RuntimeError: CUDA error: out of memory
However, our library will freeze and report nothing.