JuliaStats / StatsModels.jl

Specifying, fitting, and evaluating statistical models in Julia
249 stars 31 forks source link

StatsModels.jl >v0.7 changes order of parameters #286

Closed sindresops closed 1 year ago

sindresops commented 1 year ago

My code was breaking because StatsModels.jl has changed the default ordering of FormulaTerm. (Intercept) used to come first, now its last. I was dependent on GLM.coef() returning the regression coefficients in that specific order. The new change is good, since now the higher order terms come first. But just an FYI in case anyone else experiences the same.

Edit 1: (Added MWE)

using Pkg; using DataFrames; Pkg.add(name="StatsModels",version="0.6.33"); using GLM; x = collect(1:10); y = 2x .+ randn(length(x)); lm(@formula(y~x+1),DataFrame(x=x,y=y));

Returns image

using Pkg; using DataFrames; Pkg.add(name="StatsModels",version="0.7"); using GLM; x = collect(1:10); y = 2x .+ randn(length(x)); lm(@formula(y~x+1),DataFrame(x=x,y=y));

Returns image

Other info Julia version 1.8.0 GLM v1.8.2

kleinschmidt commented 1 year ago

Do you have an example you can share?

sindresops commented 1 year ago

Apologies. I was being lazy. I have updated the original post with a minimum working example.

kleinschmidt commented 1 year ago

Ah wow that's a surprise to me! IMO the earlier behavior was a bug, since we (generally) sort terms by degree even in pre-0.7. In fact, I can't reproduce your example on my machine:

Project setup activate --temp Activating new project at `/var/folders/kg/y0c0ksr56_g800hjvcpx_d4w0000gp/T/jl_ipptqs` (jl_ipptqs) pkg> add DataFrames, GLM, StatsModels@0.6.33 Updating registry at `~/.julia/registries/Beacon` Updating git-repo `https://github.com/beacon-biosignals/BeaconRegistry.git` Updating registry at `~/.julia/registries/General.toml` Resolving package versions... Installed FillArrays ──── v1.0.0 Installed Distributions ─ v0.25.87 Updating `/private/var/folders/kg/y0c0ksr56_g800hjvcpx_d4w0000gp/T/jl_ipptqs/Project.toml` [a93c6f00] + DataFrames v1.5.0 [38e38edf] + GLM v1.8.2 ⌃ [3eaba693] + StatsModels v0.6.33 Updating `/private/var/folders/kg/y0c0ksr56_g800hjvcpx_d4w0000gp/T/jl_ipptqs/Manifest.toml` [49dc2e85] + Calculus v0.5.1 [d360d2e6] + ChainRulesCore v1.15.7 [9e997f8a] + ChangesOfVariables v0.1.6 [34da2185] + Compat v4.6.1 [a8cc5b0e] + Crayons v4.1.1 [9a962f9c] + DataAPI v1.14.0 [a93c6f00] + DataFrames v1.5.0 [864edb3b] + DataStructures v0.18.13 [e2d170a0] + DataValueInterfaces v1.0.0 [b429d917] + DensityInterface v0.4.0 [31c24e10] + Distributions v0.25.87 [ffbed154] + DocStringExtensions v0.9.3 [fa6b7ba4] + DualNumbers v0.6.8 [1a297f60] + FillArrays v1.0.0 [59287772] + Formatting v0.4.2 [38e38edf] + GLM v1.8.2 [34004b35] + HypergeometricFunctions v0.3.14 [842dd82b] + InlineStrings v1.4.0 [3587e190] + InverseFunctions v0.1.8 [41ab1584] + InvertedIndices v1.3.0 [92d709cd] + IrrationalConstants v0.2.2 [82899510] + IteratorInterfaceExtensions v1.0.0 [692b3bcd] + JLLWrappers v1.4.1 [b964fa9f] + LaTeXStrings v1.3.0 [2ab3a3ac] + LogExpFunctions v0.3.23 [e1d29d7a] + Missings v1.1.0 [77ba4419] + NaNMath v1.0.2 [bac558e1] + OrderedCollections v1.6.0 [90014a1f] + PDMats v0.11.17 [69de0a69] + Parsers v2.5.8 [2dfb63ee] + PooledArrays v1.4.2 [21216c6a] + Preferences v1.3.0 [08abe8d2] + PrettyTables v2.2.3 [1fd47b50] + QuadGK v2.8.2 [189a3867] + Reexport v1.2.2 [79098fc4] + Rmath v0.7.1 [91c51154] + SentinelArrays v1.3.18 [1277b4bf] + ShiftedArrays v2.0.0 [66db9d55] + SnoopPrecompile v1.0.3 [a2af1166] + SortingAlgorithms v1.1.0 [276daf66] + SpecialFunctions v2.2.0 [82ae8749] + StatsAPI v1.6.0 [2913bbd2] + StatsBase v0.33.21 [4c63d2b9] + StatsFuns v1.3.0 ⌃ [3eaba693] + StatsModels v0.6.33 [892a3eda] + StringManipulation v0.3.0 [3783bdb8] + TableTraits v1.0.1 [bd369af6] + Tables v1.10.1 [efe28fd5] + OpenSpecFun_jll v0.5.5+0 [f50d1b31] + Rmath_jll v0.4.0+0 [0dad84c5] + ArgTools v1.1.1 [56f22d72] + Artifacts [2a0f44e3] + Base64 [ade2ca70] + Dates [f43a241f] + Downloads v1.6.0 [7b1f6079] + FileWatching [9fa8497b] + Future [b77e0a4c] + InteractiveUtils [b27032c2] + LibCURL v0.6.3 [76f85450] + LibGit2 [8f399da3] + Libdl [37e2e46d] + LinearAlgebra [56ddb016] + Logging [d6f4376e] + Markdown [ca575930] + NetworkOptions v1.2.0 [44cfe95a] + Pkg v1.8.0 [de0858da] + Printf [3fa0cd96] + REPL [9a3f8284] + Random [ea8e919c] + SHA v0.7.0 [9e88b42a] + Serialization [6462fe0b] + Sockets [2f01184e] + SparseArrays [10745b16] + Statistics [4607b0f0] + SuiteSparse [fa267f1f] + TOML v1.0.0 [a4e569a6] + Tar v1.10.1 [8dfed614] + Test [cf7118a7] + UUIDs [4ec0a83e] + Unicode [e66e0078] + CompilerSupportLibraries_jll v1.0.1+0 [deac9b47] + LibCURL_jll v7.84.0+0 [29816b5a] + LibSSH2_jll v1.10.2+0 [c8ffd9c3] + MbedTLS_jll v2.28.0+0 [14a3606d] + MozillaCACerts_jll v2022.2.1 [4536629a] + OpenBLAS_jll v0.3.20+0 [05823500] + OpenLibm_jll v0.8.1+0 [83775a58] + Zlib_jll v1.2.12+3 [8e850b90] + libblastrampoline_jll v5.1.1+0 [8e850ede] + nghttp2_jll v1.48.0+0 [3f19e933] + p7zip_jll v17.4.0+0 Info Packages marked with ⌃ have new versions available and may be upgradable. Precompiling project... 3 dependencies successfully precompiled in 7 seconds. 54 already precompiled. ```
julia> using Pkg; using DataFrames; Pkg.add(name="StatsModels",version="0.6.33"); using GLM; x = collect(1:10); y = 2x .+ randn(length(x)); lm(@formula(y~x+1),DataFrame(x=x,y=y))
   Resolving package versions...
  No Changes to `/private/var/folders/kg/y0c0ksr56_g800hjvcpx_d4w0000gp/T/jl_ipptqs/Project.toml`
  No Changes to `/private/var/folders/kg/y0c0ksr56_g800hjvcpx_d4w0000gp/T/jl_ipptqs/Manifest.toml`
[ Info: Precompiling GLM [38e38edf-8417-5370-95a0-9cbb8c7f171a]
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

y ~ 1 + x

Coefficients:
─────────────────────────────────────────────────────────────────────────
                 Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
─────────────────────────────────────────────────────────────────────────
(Intercept)  -0.259833    0.85121   -0.31    0.7680   -2.22273    1.70306
x             2.02703     0.137185  14.78    <1e-06    1.71068    2.34338
─────────────────────────────────────────────────────────────────────────

(jl_ipptqs) pkg> st
Status `/private/var/folders/kg/y0c0ksr56_g800hjvcpx_d4w0000gp/T/jl_ipptqs/Project.toml`
  [a93c6f00] DataFrames v1.5.0
  [38e38edf] GLM v1.8.2
⌃ [3eaba693] StatsModels v0.6.33
Info Packages marked with ⌃ have new versions available and may be upgradable.
kleinschmidt commented 1 year ago

Ah wow wait nevermind, I misread the report. This is definitely a bug!

kleinschmidt commented 1 year ago

I think the issue is that the constant term is incorrectly assigned the same degree as x. Sorting works correctly with an interaction term like this:

julia> f2 = @formula(y ~ x & z + x + 1)
FormulaTerm
Response:
  y(unknown)
Predictors:
  x(unknown)
  1
  x(unknown) & z(unknown)

(jl_QKrdLJ) pkg> st
Status `/private/var/folders/kg/y0c0ksr56_g800hjvcpx_d4w0000gp/T/jl_QKrdLJ/Project.toml`
  [a93c6f00] DataFrames v1.5.0
  [38e38edf] GLM v1.8.2
  [3eaba693] StatsModels v0.7.0
kleinschmidt commented 1 year ago

@sindresops this is fixed on master now and will be released as StatsModels 0.7.1: https://github.com/JuliaRegistries/General/pull/81005

Thanks for the report!

sindresops commented 1 year ago

Thanks for patching!