JuliaStats / GLM.jl

Generalized linear models in Julia
Other
584 stars 114 forks source link

DataFrames dependency error #472

Closed compleathorseplayer closed 2 years ago

compleathorseplayer commented 2 years ago

The following has come up in my own work and in the classes that I teach.

Whenever the GLM library is used, the DataFrames is required, though it is possible to load the GLM library with 'import' or 'using' without any warnings or error messages, even if DataFrames is not there.

The routines will not run until DataFrames is imported, though the error messages do not state that the issue is this unmet dependency [it took me a long time to figure out why the same code worked once for me and not later]

ararslan commented 2 years ago

Hi @compleathorseplayer, sorry to hear you're running into trouble!

Note though that the DataFrames library is not actually required for using GLM; in the "fitting GLM models" section of the documentation, it states that any structure that is compatible with the interface specified by the Tables library can be used. This includes things like vectors of named tuples, which can be constructed without any other package dependencies. It also includes DataFrames, which are available when the DataFrames library is installed and loaded.

Do you have an example of some code that isn't working as expected? I think that would help pinpoint the issue.

compleathorseplayer commented 2 years ago

Thanks. It seems to have to do with @formula() - the lm(x,y) syntax seems to work regardless.

ararslan commented 2 years ago

It seems to have to do with @formula()

Hm, interesting. It'd be good to see an example of some code you have that doesn't work if you can provide one. In the meantime, @kleinschmidt, have you seen anything like this before?

kleinschmidt commented 2 years ago

I can't reproduce (see below for a working example with JUST GLM). My hunch is that StatsModels/@formula needs some kind of Tables.jl table (e.g., a named tuple of vectors like below), but it doesn't have to be a dataframe. If you want to provide input as a DataFrame, you need DataFrames.jl as a dependency.

julia> using Pkg; Pkg.add("GLM")

julia> using GLM

julia> my_table = (; y=rand(10), x1=rand(10), x2=rand('a':'b', 10))
(y = [0.5682488636362382, 0.17197789596062807, 0.3506216334793084, 0.8072853497852225, 0.5012640861462717, 0.8900214619075134, 0.5315620660933361, 0.3094426146385296, 0.20359501557647441, 0.3161968669038068], x1 = [0.12550017823736537, 0.14895836178426625, 0.7314434538141096, 0.5441900453308146, 0.4189847366481383, 0.28566682522788844, 0.3849599719979039, 0.8120194120664842, 0.9961705212901149, 0.9210058138271773], x2 = ['a', 'b', 'a', 'a', 'a', 'a', 'b', 'a', 'a', 'a'])

julia> lm(@formula(y ~ x1 * x2), my_table)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}}}}, Matrix{Float64}}

y ~ 1 + x1 + x2 + x1 & x2

Coefficients:
──────────────────────────────────────────────────────────────────────────
                 Coef.  Std. Error      t  Pr(>|t|)  Lower 95%   Upper 95%
──────────────────────────────────────────────────────────────────────────
(Intercept)   0.855467     0.14322   5.97    0.0010   0.505021   1.20591
x1           -0.599188     0.21338  -2.81    0.0308  -1.12131   -0.077066
x2: b        -0.91045      0.33984  -2.68    0.0366  -1.74201   -0.0788915
x1 & x2: b    2.12284      1.07723   1.97    0.0963  -0.513047   4.75872
──────────────────────────────────────────────────────────────────────────
compleathorseplayer commented 2 years ago

OK Thanks all - I am responding to student queries which were resolved by loading DataFrames. I am sorry I don't have the specific example, which occurred for me a couple weeks ago. Perhaps the only issue was that the error messages did not mention the unmet dependency of Tables or DataFrames. Thanks

nalimilan commented 2 years ago

Closing then, feel free to reopen if you have a specific case.