Differences with glmnet

[x] Since X is completely generalized, we have to store the dot-product into dresid and then subtract from resid. I don't think there's any further optimization we can do here. It's the cost of the flexibility.
[x] glmnet absorbs weights into X and creates a copy. Again, not something we can do in general for all matrices.
[x] Currently, adelie doesn't optimize for gradient computation. Specifically, it only needs to be valid on non-screen group coefficients. Similarly, abs_grad only needs to be valid outside screen set. This may help when screen set is large. The two screening rules must be amended accordingly. This doesn't seem to be the bottleneck at all in cases I care about.
[x] Ordering of the active set is possibly different at each call. ~~Shouldn't matter too much?~~ No, I think it does matter a lot when correlation is high. THIS WAS THE CULPRIT!!!
[x] Revisit tolerance for gaussian_pin call in glm and gaussian solver.
[x] Benchmarking makes it clear that residual update is the bottleneck! We should just update directly. Change all the btmul and ctmul interface.
[x] ~~lasso_active_called in pin methods is always false if the lmda_path is size 1, which is all the time! We should just set this to true.~~ This doesn't seem to be the root cause of the differences.
[x] Benchmarking revealed that always adding at least 1 variable for pivot rule is not the right thing to do! If no new active variables entered, then no screen variables should be added.
[x] Check benchmark case Gaussian n=100 vs n=1000 at rho=0 and p=10^4.
[x] Check benchmark case Binomial rho=0 vs 0.95 at n=1000 and p=500.

JamesYang007 / adelie

Differences with glmnet #81