NaN Numerical Error in Neighbor_Allreduce

Bluefog-Lib commented 4 years ago

For example, running with bfrun -np 8 python pytorch_logistic_regression.py --method=exact_diffusion it will become NaN.

kunyuan827 commented 4 years ago

The current conclusion is NaN error comes from the limitation of some algorithm, e.g., exact diffusion and extra. Exact diffusion/EXTRA does not support non-symmetric communication matrices well, and the power-two-ring is a doubly stochastic but non-symmetric matrix.

kunyuan827 commented 4 years ago

Exact diffusion and EXTRA have to exploit the symmetric and positive definite (to be more accurate, W >= -(1/3)I ) communication matrix. Otherwise, Exact diffusion and EXTRA may diverge.

For example, we find scenarios where Exact diffusion diverges when W is generated by power_two_ring (which is not symmetric) or by mesh topology (which is not positive definite if we do not use W_bar = (W + I)/2).

For non-symmetric, non-positive definite but doubly stochastic communication matrix W, it is suggested to use DGD/diffusion/gradient tracking.

Bluefog-Lib / bluefog

NaN Numerical Error in Neighbor_Allreduce #26