bwlewis / GLM

Notes on generalized linear models
http://bwlewis.github.io/GLM
111 stars 16 forks source link

All glm algo's much slower than glm.fit - problem with starting values? #4

Open tomwenseleers opened 5 years ago

tomwenseleers commented 5 years ago

I noticed that all algo's listed are significantly slower than the inbuilt glm.fit. Why is this? Does glm.fit use better starting values perhaps? Or is this due to the use of .Call(C_Cdqrls...) in glm.fit as opposed to the use of solve() in the irls code?

bwlewis commented 5 years ago

Yes, it's funny you brought this up just now. I have a big revision of these notes almost ready that has been lingering around for ages.

As you point out the starting values are a big issue, also related to convergence problems in some cases.

I expect the updated notes to get posted soon, perhaps today...

On 8/8/18, Tom Wenseleers notifications@github.com wrote:

I noticed that all algo's listed are significantly slower than the inbuilt glm.fit. Why is this? Does glm.fit use better starting values perhaps? Or is this due to the use of .Call(C_Cdqrls as opposed to the use of solve() in the glm.fit code?

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/bwlewis/GLM/issues/4

tomwenseleers commented 5 years ago

Many thanks for this - I found your notes to be really helpful!

Interestingly though, for a very small problem I was testing here now (3 variables, 1000 cases, Poisson error) I found the irlsfunction still to be about 4x slower than the inbuilt glm.fit one, even if I specify identical sensible starting values for both. I thought this speed difference could come about from a difference in speed of x = solve(crossprod(A,W*A), crossprod(A,W*z), tol=2*.Machine$double.eps)in the irlsfunction vs. x = :.Call(C_Cdqrls, A * W, z * W, min(1e-7, control$epsilon/1000), check=FALSE) in glm.fit, but not sure (I'm running Microsoft Open R 3.5.0 on Windows with Intel MKL optimized BLAS loaded)... So although differences in the choice of starting values may explain some of the discrepancy in timings, it's probably still not the whole story...

Looking forward to your update!

bwlewis commented 5 years ago

glm.svd.r.gz

I don't have time to finish the new write up today, but try this new code attached (as a gzipped file, that's all that GitHub would accept somehow). It's a very nearly complete new reference implementation using the SVD approach. On modern systems linked to a decent BLAS library it should be faster than glm.fit for large problems. For example, on my quad-core AMD home PC I ran this quick and dirty test:

source("glm.svd.r")
x = matrix(rnorm(10000 * 1000), 10000)
y = sample(c(0,1), 10000, replace=TRUE)

system.time({m=glm.fit(x,y,family=binomial())})
#   user  system elapsed
# 40.588   0.120  40.687

system.time({m1=glm.svd(x,y,family=binomial())})
#   user  system elapsed
# 17.724   0.968   5.335

This code will be posted to the repository as soon as I can finish the new write up, which is quite involved...

tomwenseleers commented 5 years ago

Ha that's great - many thanks for that! I see you use the same initialization strategy now as in R's implementation, using eval(family$initialize)! On my 4 core laptop with Intel MKL loaded this is indeed 6 times faster than glm.fit (finishing in 4s vs 25s for glm.fit)! That's really great - thanks a lot!

tomwenseleers commented 1 year ago

Might be worth changing the irls function that you lay out to something like this, i.e. with correct initialization :

irls = function(X, y, weights=rep(1,nrow(X)), family=poisson(log), maxit=25, tol=1e-16) {
    if (!is(family, "family")) family = family()
    variance = family$variance
    linkinv = family$linkinv
    mu.eta = family$mu.eta
    etastart = NULL

    nobs = nrow(X)    # needed by the initialize expression below
    nvars = ncol(X)   # needed by the initialize expression below
    eval(family$initialize) # initializes n and fitted values mustart
    eta = family$linkfun(mustart) # we then initialize eta with this
    dev.resids = family$dev.resids
    dev = sum(dev.resids(y, linkinv(eta), weights))
    devold = 0
    beta_old = rep(1, nvars)

    for(j in 1:maxit)
    {
      mu = linkinv(eta) 
      varg = variance(mu)
      gprime = mu.eta(eta)
      z = eta + (y - mu) / gprime # potentially -offset if you would have an offset argument as well
      W = weights * as.vector(gprime^2 / varg)
      beta = solve(crossprod(X,W*X), crossprod(X,W*z), tol=2*.Machine$double.eps)
      eta = X %*% beta # potentially +offset if you would have an offset argument as well
      dev = sum(dev.resids(y, mu, weights))
      if (abs(dev - devold) / (0.1 + abs(dev)) < tol) break
      # or if (norm(beta-beta_old, "2") < tol) break
      devold = dev
      beta_old = beta
    }
    list(coefficients=beta, iterations=j)
}