Closed ablaom closed 2 years ago
Model objects/structs containing the associated model matrices is pretty standard in statistical computing, at least for classical "statistical" (as opposed to "machine learning" models, see also Breiman's Two Cultures for a similar distinction). There are various both practical and historical reasons for this, but the most obvious is that a lot of quantities that you would typically inspect in a statistical context can't be computed without the model matrices. If you just want the coefficients -- which are all the parameters being estimated -- for a GLM, then there's coef
. I supposed you could write a wrapper that computes and stores all the quantities that you care about, maybe something like GLMFit
or GLMSummary
that stores the link function, distribution/family, log likelihood, deviance, coefficients, etc. -- depending on whether we (=all the JuliaStats maintainers) could agree on a shared set of "summary" values, we could even see about adding that to GLM.jl.
Looking at the linked issue: you could also call empty!
on the fields holding the model matrices. For predictions with predict
, that shouldn't make a difference since matrices are allocated for that anyway and you only need the coefficient vector and information about the link.
@palday Thanks for that lucid clarification, and for looking into our use-case.
Looking at the linked issue: you could also call empty! on the fields holding the model matrices. For predictions with predict, that shouldn't make a difference since matrices are allocated for that anyway and you only need the coefficient vector and information about the link.
That sounds like an excellent way to proceed, thanks.
It seems that GLM models store the data that they are trained on, which I did not expect:
What then is the protocol for serialising GLM models trained on large datasets?