JuliaStats / GLM.jl

Generalized linear models in Julia
Other
587 stars 114 forks source link

Taking weighting seriously #487

Open gragusa opened 2 years ago

gragusa commented 2 years ago

This PR addresses several problems with the current GLM implementation.

Current status In master, GLM/LM only accepts weights through the keyword wts. These weights are implicitly frequency weights.

With this PR FrequencyWeights, AnalyticWeights, and ProbabilityWeights are possible. The API is the following

## Frequency Weights
lm(@formula(y~x), df; wts=fweights(df.wts)
## Analytic Weights
lm(@formula(y~x), df; wts=aweights(df.wts)
## ProbabilityWeights
lm(@formula(y~x), df; wts=pweights(df.wts)

The old behavior -- passing a vector wts=df.wts is deprecated and for the moment, the array os coerced df.wts to FrequencyWeights.

To allow dispatching on the weights, CholPred takes a parameter T<:AbstractWeights. The unweighted LM/GLM has UnitWeights as the parameter for the type.

This PR also implements residuals(r::RegressionModel; weighted::Bool=false) and modelmatrix(r::RegressionModel; weighted::Bool = false). The new signature for these two methods is pending in StatsApi.

There are many changes that I had to make to make everything work. Tests are passing, but some new feature needs new tests. Before implementing them, I wanted to ensure that the approach taken was liked.

I have also implemented momentmatrix, which returns the estimating function of the estimator. I arrived to the conclusion that it does not make sense to have a keyword argument weighted. Thus I will amend https://github.com/JuliaStats/StatsAPI.jl/pull/16 to remove such a keyword from the signature.

Update

I think I covered all the suggestions/comments with this exception as I have to think about it. Maybe this can be addressed later. The new standard errors (the one for ProbabilityWeights) also work in the rank deficient case (and so does cooksdistance).

Tests are passing and I think they cover everything that I have implemented. Also, added a section in the documentation about using Weights and updated jldoc with the new signature of CholeskyPivoted.

To do:

Closes https://github.com/JuliaStats/GLM.jl/issues/186.

alecloudenback commented 1 year ago

Sorry for the noise, but thank you @gragusa and reviewers for this big PR. As a user I've been watching for weighting for a while and appreciate the technical expertise and dedication to quality here.

gragusa commented 1 year ago

@nalimilan let’s give this a final push. Should I rebase this PR against #339 ? (rhetorical question!) What’s the most efficient way?

nalimilan commented 1 year ago

Yes the PR needs to be rebased against master -- or, simpler, merge master into the branch. Most conflicts seem relatively simple to resolve. You can try doing this online on GitHub, though there's always a chance that it won't be 100% correct the first time. Otherwise you can do that locally with git fetch; git merge origin/master. Or I can do it in a few days if you want.

gragusa commented 1 year ago

Don’t worry .. I am already on it.

Sent from Outlook for iOShttps://aka.ms/o0ukef


From: Milan Bouchet-Valat @.> Sent: Wednesday, November 23, 2022 7:41:56 PM To: JuliaStats/GLM.jl @.> Cc: Giuseppe Ragusa @.>; Mention @.> Subject: Re: [JuliaStats/GLM.jl] Taking weighting seriously (PR #487)

Yes the PR needs to be rebased against master -- or, simpler, merge master into the branch. Most conflicts seem relatively simple to resolve. You can try doing this online on GitHub, though there's always a chance that it won't be 100% correct the first time. Otherwise you can do that locally with git fetch; git merge origin/master. Or I can do it in a few days if you want.

— Reply to this email directly, view it on GitHubhttps://github.com/JuliaStats/GLM.jl/pull/487#issuecomment-1325507314, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAD5DAUXTDGDDJXXCRUZLHTWJZQPJANCNFSM53WBWMMQ. You are receiving this because you were mentioned.Message ID: @.***>

jeremiedb commented 1 year ago

@nalimilan were there remaining fixes to have this PR completed? I was worried to have the important work brought by this PR loose its momentum.

gragusa commented 1 year ago

Mostly time 🤣

I think there few things to fix (addressing all the comments of @nalimilan) and making few decisions …

I could Find some time in the next week to finish it if @nalimilan has some time to support me.

Sent from Outlook for iOShttps://aka.ms/o0ukef


From: jeremiedb @.> Sent: Wednesday, March 1, 2023 5:21:44 AM To: JuliaStats/GLM.jl @.> Cc: Giuseppe Ragusa @.>; Mention @.> Subject: Re: [JuliaStats/GLM.jl] Taking weighting seriously (PR #487)

@nalimilanhttps://github.com/nalimilan were there remaining fixes to have this PR completed? I was worried to have the important work brought by this PR loose its momentum.

— Reply to this email directly, view it on GitHubhttps://github.com/JuliaStats/GLM.jl/pull/487#issuecomment-1449318346, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAD5DAXJBMS64DMYJSZ573TWZ3FFRANCNFSM53WBWMMQ. You are receiving this because you were mentioned.Message ID: @.***>

nalimilan commented 1 year ago

Sure !

bkamins commented 1 year ago

While we are at weights my question is if we should not update the ftest implementation also? The issue is that currently ftest assumes that nobs and dof are integer (which does not have to hold with weights).

ParadaCarleton commented 1 year ago

While we are at weights my question is if we should not update the ftest implementation also? The issue is that currently ftest assumes that nobs and dof are integer (which does not have to hold with weights).

ftest, in its original form, won’t work here, I think. Mentioning @smishr who might know more about the specifics here.

bkamins commented 1 year ago

ftest, in its original form, won’t work here

I agree that we would need to carefully consider all cases of weights. I have not thought about probabilistic weights. However, for frequency weights and analytical weights, assuming we produce correct deviance and dof_residual things should be correct.

I have just checked against examples in Wooldridge, chapter 8 and properly scaled analytical weights produce correct result.

Also, in general, I think we should ensure that every function in GLM.jl that accepts model estimated with weighting should either:

(it does not have to be in this PR, but if we are taking weighting seriously, I think we should ensure this property when we make a release)

Thank you for working on this!

smishr commented 1 year ago

In survey datasets weights are commonly calibrated to sum upto an (integral) population size.

While we are at weights my question is if we should not update the ftest implementation also? The issue is that currently ftest assumes that nobs and dof are integer (which does not have to hold with weights).

While most applications of F-test have integral dof, the F distribution is continuous and well-defined for non-integral values of dof.

In R:

> df(1.2, df1 = 10, df2 = 20)
[1] 0.5626125
> df(1.2, df1 = 10, df2 = 20.1)
[1] 0.5630353

Julia and R agree

julia> using Distributions

julia> d = FDist(10, 20)
FDist{Float64}(ν1=10.0, ν2=20.0)

julia> pdf(d, 1.2)
0.5626124566227022

julia> d = FDist(10, 20.1)
FDist{Float64}(ν1=10.0, ν2=20.1)

julia> pdf(d, 1.2)
0.5630352744353205

There is this StackExchange post discussing non-integral dof for t-tests and for GAMs in this post.

ftest, in its original form, won’t work here, I think. Mentioning @smishr who might know more about the specifics here.

The F-test is essentially the ratio of two variances. For the weighted GLM case, variances based on the weighted Least Squares could be used to calculate test statistic.

Note: whether doing an (adjusted) F-Test for comparing weighted GLM models is the right approach, that is up for debate...

ParadaCarleton commented 1 year ago

Hmm, did any of the people who worked on Survey.jl leave comments here? @iuliadmtru @aviks

gragusa commented 1 year ago

I finally found the time to rebase this PR against the latest main repository. Tests pass locally; let's see whether they pass on the CI.

I have a few days of "free" time and would like to finish this. @nalimilan It is difficult to track the comments and which ones were addressed by the various commit. On my side, the primary decision is about weight scaling. But before engaging in a conversation, I will add documentation so that whoever will contribute to the discussion can do it coherently.

Test passed!

nalimilan commented 1 year ago

Cool. Do you need any input from my side?

SamuelMathieu-code commented 7 months ago

Hi there! I wonder what will happen to this PR? As I understand, one review from a person with write access is needed?

gragusa commented 6 months ago

Just wanted to give a quick update on the PR.

The PR was almost ready to go, but now, with more PR being merged, there are a few things that need to be straightened out. I should be able to work on it again next week to make sure everything's in good shape. Then I hope somebody will help get this merged.

From: Samuel Mathieu @.> Date: Wednesday, January 31, 2024 at 17:37 To: JuliaStats/GLM.jl @.> Cc: Giuseppe Ragusa @.>, Mention @.> Subject: Re: [JuliaStats/GLM.jl] Taking weighting seriously (PR #487)

Hi there! I wonder what will happen to this PR? As I understand, one review from a person with write access is needed?

— Reply to this email directly, view it on GitHubhttps://github.com/JuliaStats/GLM.jl/pull/487#issuecomment-1919479838, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAD5DAVWZZIQDXOS5AQHBSLYRJXN5AVCNFSM53WBWMM2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOJRHE2DOOJYGM4A. You are receiving this because you were mentioned.Message ID: @.***>