JuliaStats / GLM.jl

Generalized linear models in Julia
Other
584 stars 114 forks source link

Negative r2 for linear model with no intercept #481

Closed mousum-github closed 2 years ago

mousum-github commented 2 years ago

The Julia lm produces negative r2 (-0.15702479338842856) for the following: data = DataFrame(x = 60:70, y = 130:140) mdl = lm(@formula(y ~ 0 + x), data) r2(mdl)

The certified value of r2 = 0.999365492298663 https://www.itl.nist.gov/div898/strd/lls/data/LINKS/DATA/NoInt1.dat

While investigating the reason, we found that the nulldeviance calculation is different. Ideally, if the model has the intercept term, then the nulldeviance = y’y – n * mean(y)^2 and if the model does not have the intercept, then the nulldeviance = y’y.

In this PR, we have attempted to correct the nulldeviance calculation.

The following is the summary of the changes:

codecov-commenter commented 2 years ago

Codecov Report

Merging #481 (7c52e8d) into master (42a0d04) will increase coverage by 2.00%. The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #481      +/-   ##
==========================================
+ Coverage   85.12%   87.13%   +2.00%     
==========================================
  Files           7        7              
  Lines         827      847      +20     
==========================================
+ Hits          704      738      +34     
+ Misses        123      109      -14     
Impacted Files Coverage Δ
src/lm.jl 96.18% <100.00%> (-0.06%) :arrow_down:
src/GLM.jl 50.00% <0.00%> (ø)
src/glmfit.jl 79.29% <0.00%> (+0.55%) :arrow_up:
src/glmtools.jl 92.99% <0.00%> (+10.28%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 42a0d04...7c52e8d. Read the comment docs.

ayushpatnaikgit commented 2 years ago

@nalimilan @ViralBShah the PR is only failing on Julia nightly—windows. I don't think it's to do with anything that we've changed. What shall we do in this situation? It's happening with PR 482 as well.

ViralBShah commented 2 years ago

Best would be to isolate a minimal test case and file a bug against Julia.

ViralBShah commented 2 years ago

Actually do check the logs. This has to do with timezones, so ignore. Need to see what is going on though

nalimilan commented 2 years ago

I think the TimeZones issue is a network problem, I've seen it before.

Regarding the PR itself, I think we'll have to adjust the definition of nulldeviance and nullloglikelihood to cover models with no intercept, as currently the null model is defined as the one with only the intercept, but this doesn't make much sense when there's no intercept. This is an issue with have recently discussed too at https://github.com/JuliaStats/GLM.jl/pull/479#issuecomment-1104452820. Cc: @palday

nalimilan commented 2 years ago

(For some reason I can't comment in the thread above.) I think eachindex is fine here. People shouldn't play fitting models where the response, predictors and/or weights have different axes. That would be asking for trouble. (Note that the current code is silently incorrect in that case anyway, so throwing an error would be an improvement.)

nalimilan commented 2 years ago

Thanks, looks good! I just wonder what release strategy we should adopt. Technically this could be considered as a breaking change (because it changes the definition of nulldeviance and nullloglikelihood, cf. https://github.com/JuliaStats/StatsAPI.jl/pull/14) or as a bugfix (because the previous result didn't make a lot of sense, and was even misleading for the R²). We probably don't want to tag GLM 2.0 just for htis. We could adopt an intermediate approach and print a warning (at least temporarily) for models without an intercept. Opinions?

palday commented 2 years ago

@nalimilan I like the idea of issuing a warning for models without intercepts and a minor version bump -- it's a good thing to call out anyway that a lot of classical summary definitions are "weird" for models without intercepts.

nalimilan commented 2 years ago

@kleinschmidt Technically, it's possible to fit a model which has an intercept but for which hasintercept returns false by passing a model matrix which applies full dummy coding without a column of ones, right? Should we try to detect this?

nalimilan commented 2 years ago

I don't have rights to edit your branch so I pushed the commit directly to master with the warning. Thanks!