[x] I have a new idea of checking how much the prediction mu(Xb + b0) changed
I really like this idea. Firstly, mu has to be tracked because it is needed to compute gradient. So we have mu^k and mu^{k+1} for free. Second, ||mu^{k+1} - mu^k||_{W0}^2 ~ sum_i w0_i v^k_i (eta^{k+1} - eta^k)^2 by doing Taylor expansion of mu. This is like measuring average prediction MSE under the IRLS weights.