Experiment with convergence criterion

[x] glmnet has 2 metrics on how we define this: https://glmnet.stanford.edu/articles/glmnet.html#appendix-0-convergence-criteria
[x] I have a new idea of checking how much the prediction mu(Xb + b0) changed
- I really like this idea. Firstly, mu has to be tracked because it is needed to compute gradient. So we have mu^k and mu^{k+1} for free. Second, ||mu^{k+1} - mu^k||_{W0}^2 ~ sum_i w0_i v^k_i (eta^{k+1} - eta^k)^2 by doing Taylor expansion of mu. This is like measuring average prediction MSE under the IRLS weights.

JamesYang007 / adelie