Closed Bluefog-Lib closed 4 years ago
The current conclusion is NaN error comes from the limitation of some algorithm, e.g., exact diffusion and extra. Exact diffusion/EXTRA does not support non-symmetric communication matrices well, and the power-two-ring is a doubly stochastic but non-symmetric matrix.
Exact diffusion and EXTRA have to exploit the symmetric and positive definite (to be more accurate, W >= -(1/3)I ) communication matrix. Otherwise, Exact diffusion and EXTRA may diverge.
For example, we find scenarios where Exact diffusion diverges when W is generated by power_two_ring (which is not symmetric) or by mesh topology (which is not positive definite if we do not use W_bar = (W + I)/2).
For non-symmetric, non-positive definite but doubly stochastic communication matrix W, it is suggested to use DGD/diffusion/gradient tracking.
For example, running with bfrun -np 8 python pytorch_logistic_regression.py --method=exact_diffusion it will become NaN.