Serialising GLM models.

ablaom commented 2 years ago

It seems that GLM models store the data that they are trained on, which I did not expect:

using GLM
using Distributions

data(N) = (rand(N, 2), rand(Bool, N))

Θ = GLM.glm(data(100)..., Distributions.Bernoulli(), GLM.LogitLink());
Base.summarysize(Θ) # 8968

Θ = GLM.glm(data(10000)..., Distributions.Bernoulli(), GLM.LogitLink());
Base.summarysize(Θ) # 800968

What then is the protocol for serialising GLM models trained on large datasets?

palday commented 2 years ago

Model objects/structs containing the associated model matrices is pretty standard in statistical computing, at least for classical "statistical" (as opposed to "machine learning" models, see also Breiman's Two Cultures for a similar distinction). There are various both practical and historical reasons for this, but the most obvious is that a lot of quantities that you would typically inspect in a statistical context can't be computed without the model matrices. If you just want the coefficients -- which are all the parameters being estimated -- for a GLM, then there's coef. I supposed you could write a wrapper that computes and stores all the quantities that you care about, maybe something like GLMFit or GLMSummary that stores the link function, distribution/family, log likelihood, deviance, coefficients, etc. -- depending on whether we (=all the JuliaStats maintainers) could agree on a shared set of "summary" values, we could even see about adding that to GLM.jl.

Looking at the linked issue: you could also call empty! on the fields holding the model matrices. For predictions with predict, that shouldn't make a difference since matrices are allocated for that anyway and you only need the coefficient vector and information about the link.

ablaom commented 2 years ago

@palday Thanks for that lucid clarification, and for looking into our use-case.

Looking at the linked issue: you could also call empty! on the fields holding the model matrices. For predictions with predict, that shouldn't make a difference since matrices are allocated for that anyway and you only need the coefficient vector and information about the link.

That sounds like an excellent way to proceed, thanks.

JuliaStats / GLM.jl

Serialising GLM models. #465