Bluefog-Lib / bluefog

Distributed and decentralized training framework for PyTorch over graph
https://bluefog-lib.github.io/bluefog/
Apache License 2.0
291 stars 71 forks source link

NaN Numerical Error in Neighbor_Allreduce #26

Closed Bluefog-Lib closed 4 years ago

Bluefog-Lib commented 4 years ago

For example, running with bfrun -np 8 python pytorch_logistic_regression.py --method=exact_diffusion it will become NaN.

kunyuan827 commented 4 years ago

The current conclusion is NaN error comes from the limitation of some algorithm, e.g., exact diffusion and extra. Exact diffusion/EXTRA does not support non-symmetric communication matrices well, and the power-two-ring is a doubly stochastic but non-symmetric matrix.

kunyuan827 commented 4 years ago

Exact diffusion and EXTRA have to exploit the symmetric and positive definite (to be more accurate, W >= -(1/3)I ) communication matrix. Otherwise, Exact diffusion and EXTRA may diverge.

For example, we find scenarios where Exact diffusion diverges when W is generated by power_two_ring (which is not symmetric) or by mesh topology (which is not positive definite if we do not use W_bar = (W + I)/2).

For non-symmetric, non-positive definite but doubly stochastic communication matrix W, it is suggested to use DGD/diffusion/gradient tracking.