JuliaStats / GLM.jl

Generalized linear models in Julia
Other
584 stars 114 forks source link

Taking weighting seriously #487

Open gragusa opened 2 years ago

gragusa commented 2 years ago

This PR addresses several problems with the current GLM implementation.

Current status In master, GLM/LM only accepts weights through the keyword wts. These weights are implicitly frequency weights.

With this PR FrequencyWeights, AnalyticWeights, and ProbabilityWeights are possible. The API is the following

## Frequency Weights
lm(@formula(y~x), df; wts=fweights(df.wts)
## Analytic Weights
lm(@formula(y~x), df; wts=aweights(df.wts)
## ProbabilityWeights
lm(@formula(y~x), df; wts=pweights(df.wts)

The old behavior -- passing a vector wts=df.wts is deprecated and for the moment, the array os coerced df.wts to FrequencyWeights.

To allow dispatching on the weights, CholPred takes a parameter T<:AbstractWeights. The unweighted LM/GLM has UnitWeights as the parameter for the type.

This PR also implements residuals(r::RegressionModel; weighted::Bool=false) and modelmatrix(r::RegressionModel; weighted::Bool = false). The new signature for these two methods is pending in StatsApi.

There are many changes that I had to make to make everything work. Tests are passing, but some new feature needs new tests. Before implementing them, I wanted to ensure that the approach taken was liked.

I have also implemented momentmatrix, which returns the estimating function of the estimator. I arrived to the conclusion that it does not make sense to have a keyword argument weighted. Thus I will amend https://github.com/JuliaStats/StatsAPI.jl/pull/16 to remove such a keyword from the signature.

Update

I think I covered all the suggestions/comments with this exception as I have to think about it. Maybe this can be addressed later. The new standard errors (the one for ProbabilityWeights) also work in the rank deficient case (and so does cooksdistance).

Tests are passing and I think they cover everything that I have implemented. Also, added a section in the documentation about using Weights and updated jldoc with the new signature of CholeskyPivoted.

To do:

Closes https://github.com/JuliaStats/GLM.jl/issues/186.

codecov-commenter commented 2 years ago

Codecov Report

Patch coverage: 84.08% and project coverage change: -2.67 :warning:

Comparison is base (c13577e) 90.48% compared to head (fa63a9a) 87.82%.

:exclamation: Current head fa63a9a differs from pull request most recent head 807731a. Consider uploading reports for the commit 807731a to get more accurate results

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #487 +/- ## ========================================== - Coverage 90.48% 87.82% -2.67% ========================================== Files 8 8 Lines 1125 1191 +66 ========================================== + Hits 1018 1046 +28 - Misses 107 145 +38 ``` | [Impacted Files](https://app.codecov.io/gh/JuliaStats/GLM.jl/pull/487?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=JuliaStats) | Coverage Δ | | |---|---|---| | [src/GLM.jl](https://app.codecov.io/gh/JuliaStats/GLM.jl/pull/487?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=JuliaStats#diff-c3JjL0dMTS5qbA==) | `50.00% <ø> (-10.00%)` | :arrow_down: | | [src/glmfit.jl](https://app.codecov.io/gh/JuliaStats/GLM.jl/pull/487?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=JuliaStats#diff-c3JjL2dsbWZpdC5qbA==) | `81.41% <79.80%> (-0.51%)` | :arrow_down: | | [src/lm.jl](https://app.codecov.io/gh/JuliaStats/GLM.jl/pull/487?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=JuliaStats#diff-c3JjL2xtLmps) | `89.06% <82.35%> (-5.35%)` | :arrow_down: | | [src/glmtools.jl](https://app.codecov.io/gh/JuliaStats/GLM.jl/pull/487?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=JuliaStats#diff-c3JjL2dsbXRvb2xzLmps) | `93.49% <85.71%> (-0.96%)` | :arrow_down: | | [src/linpred.jl](https://app.codecov.io/gh/JuliaStats/GLM.jl/pull/487?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=JuliaStats#diff-c3JjL2xpbnByZWQuamw=) | `91.41% <90.32%> (-6.92%)` | :arrow_down: | ... and [1 file with indirect coverage changes](https://app.codecov.io/gh/JuliaStats/GLM.jl/pull/487/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=JuliaStats)

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

lrnv commented 2 years ago

Hey,

Would that fix the issue I am having, which is that if rows of the data contains missing values, GLM discard those rows, but does not discard the corresponding values of df.weights and then yells that there are too many weights ?

I think the interfacing should allow for a DataFrame input of weights, that would take care of such things (like it does for the other variables).

gragusa commented 2 years ago

Would that fix the issue I am having, which is that if rows of the data contains missing values, GLM discard those rows, but does not discard the corresponding values of df.weights and then yells that there are too many weights ?

not really. But it would be easy to make this a feature. But before digging further on this I would like to know whether there is consensus on the approach of this PR.

alecloudenback commented 1 year ago

FYI this appears to fix #420; a PR was started in #432 and the author closed for lack of time on their part to investigate CI failures.

Here's the test case pulled from #432 which passes with the in #487.

@testset "collinearity and weights" begin
    rng = StableRNG(1234321)
    x1 = randn(100)
    x1_2 = 3 * x1
    x2 = 10 * randn(100)
    x2_2 = -2.4 * x2
    y = 1 .+ randn() * x1 + randn() * x2 + 2 * randn(100)
    df = DataFrame(y = y, x1 = x1, x2 = x1_2, x3 = x2, x4 = x2_2, weights = repeat([1, 0.5],50))
    f = @formula(y ~ x1 + x2 + x3 + x4)
    lm_model = lm(f, df, wts = df.weights)#, dropcollinear = true)
    X = [ones(length(y)) x1_2 x2_2]
    W = Diagonal(df.weights)
    coef_naive = (X'W*X)\X'W*y
    @test lm_model.model.pp.chol isa CholeskyPivoted
    @test rank(lm_model.model.pp.chol) == 3
    @test isapprox(filter(!=(0.0), coef(lm_model)), coef_naive)
end

Can this test set be added?

Is there any other feedback for @gragusa ? It would be great to get this merged if good to go.

nalimilan commented 1 year ago

Sorry for the long delay, I hadn't realized you were waiting for feedback. Looks great overall, please feel free to finish it! I'll try to find the time to make more specific comments.

bkamins commented 1 year ago

A very nice PR. In the tests can we have some test set that compares the results of aweights, fweights, and pweights for the same set of data (coeffs, predictions, covariance matrix of the estimates, p-values etc.).

gragusa commented 1 year ago

@nalimilan thanks for the careful review - I will make the suggestions part of the PR.

As for the testing - as suggested by @bkamins - probably we can use the same data to test everything with different weights. I will try to do this.

Mas for the QR: I want to add the pivoted QR - however I am not sure how to do this … it is easy to write a rank revealing pivoted qr decomposition, but I am not sure how to use to obtain the solution for the lin indep columns of the system.

Related to pivoting: something we should definitely add is the possibility for crossmatrix e modelmatrix to return the linear independent columns together with the indexes in the original full matrix. This would fix cookdistance (which only work form full rank designs) and also make momentmatrix useful for rank deficient. Is there any thoughts about this?

gragusa commented 1 year ago

@nalimilan or somebody with GitHub fu, Accidentally sync my fork at gragusa/GLM.jl with JuliaStats/GLM:master. I have the commits in a branch gragusa/GLM.jl/JuliaStats-master. But if I push changes to this branch, the PR is not updated. Any help?

nalimilan commented 1 year ago

Are you sure? I see the same commits here and on the branch (Throw error if GlmResp are not AbastractWeights).

nalimilan commented 1 year ago

Mas for the QR: I want to add the pivoted QR - however I am not sure how to do this … it is easy to write a rank revealing pivoted qr decomposition, but I am not sure how to use to obtain the solution for the lin indep columns of the system.

I haven't investigated this at all, but shouldn't this be similar to how it works for Cholesky decomposition?

Related to pivoting: something we should definitely add is the possibility for crossmatrix e modelmatrix to return the linear independent columns together with the indexes in the original full matrix. This would fix cookdistance (which only work form full rank designs) and also make momentmatrix useful for rank deficient. Is there any thoughts about this?

That's tricky. Let's discuss this in a separate issue as we have enough on our plate here. ;-)

gragusa commented 1 year ago

Are you sure? I see the same commits here and on the branch (Throw error if GlmResp are not AbastractWeights).

Yes. I can see the commit as well!

gragusa commented 1 year ago

That's tricky. Let's discuss this in a separate issue as we have enough on our plate here. ;-)

Ok. I will work on it, once I tame this monster here!

gragusa commented 1 year ago

I updated the documentation, but while it is correctly built locally, I see failures on checks here.

The reason is that the type signature for GLM and LM differ from the one I am getting locally. Locally I am using Julia 1.7 - the doc GitHub action probably uses Julia 1.8. Is it possible that type signature are different between versions? I will try later to build locally with 1.8 and update the doc in case that’s the reason.

bkamins commented 1 year ago

Is it possible that type signature are different between versions?

Yes, it is possible. Output changes are not considered breaking between Julia versions (but I have not checked if this is the case here; another common reason of such problems arises if locally you have a different set of packages loaded than on CI - then the prefixes of types might change)

gragusa commented 1 year ago

No itis not the prefix, is the CholPred signature output is different in the CI than locally and btw, the output seems to be wrong and messing with the curly brackets. But I am checking this carefully

Get Outlook for iOShttps://aka.ms/o0ukef


From: Bogumił Kamiński @.> Sent: Thursday, September 8, 2022 1:48:45 PM To: JuliaStats/GLM.jl @.> Cc: Giuseppe Ragusa @.>; Mention @.> Subject: Re: [JuliaStats/GLM.jl] Taking weighting seriously (PR #487)

Is it possible that type signature are different between versions?

Yes, it is possible. Output changes are not considered breaking between Julia versions (but I have not checked if this is the case here; another common reason of such problems arises if locally you have a different set of packages loaded than on CI - then the prefixes of types might change)

— Reply to this email directly, view it on GitHubhttps://github.com/JuliaStats/GLM.jl/pull/487#issuecomment-1240609321, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAD5DAT4O6J2F3FSL2OCPT3V5HHB3ANCNFSM53WBWMMQ. You are receiving this because you were mentioned.Message ID: @.***>

bkamins commented 1 year ago

Now I have looked at the output - it seems that the order of 2 type parameters is reversed.

gragusa commented 1 year ago

Now I have looked at the output - it seems that the order of 2 type parameters is reversed.

What happened is that CholeskyPivoted changed the signature between versions

gragusa commented 1 year ago

@nalimilan what do you think about this?

gragusa commented 1 year ago

@nalimilan I committed your suggested changes. I see you bumped several review comments that i had pervasively addressed (or answered). I am going to report these here:

1) https://github.com/JuliaStats/GLM.jl/pull/487#discussion_r985297147 Why dof_residual changed type ? Because UnitWeights is of eltype Int (previously no weighting meant an empty float array.

2) https://github.com/JuliaStats/GLM.jl/pull/487#discussion_r985297465 What is the question you are referring to?

3) https://github.com/JuliaStats/GLM.jl/pull/487#discussion_r985297566

and other I think.

As far as the general PR: I am very very puzzled by the loglikelihood, deviance and nulldeviance. I tried R and stata and for weights they all give different conflicting results. I have to think about this.....I cannot see the code since R is GPL and Stata is proprietary.

nalimilan commented 1 year ago

Ah sorry the GitHub UI doesn't show the threads in which I added the comments. These are: https://github.com/JuliaStats/GLM.jl/pull/487/files/dbc9ae999d0dd43af023e5b182014fdef7afa41f#r958885835 https://github.com/JuliaStats/GLM.jl/pull/487/files/dbc9ae999d0dd43af023e5b182014fdef7afa41f#r978083032 https://github.com/JuliaStats/GLM.jl/pull/487/files/dbc9ae999d0dd43af023e5b182014fdef7afa41f#r978061661 https://github.com/JuliaStats/GLM.jl/pull/487/files/dbc9ae999d0dd43af023e5b182014fdef7afa41f#r978063134

  1. Why dof_residual changed type ? Because UnitWeights is of eltype Int (previously no weighting meant an empty float array.

Ah, I see, I think I originally defined nobs to always return a float to make the function type-stable, as the presence/absence of weights was not reflected in the model type. Now this is OK thanks to weight vector types.

As far as the general PR: I am very very puzzled by the loglikelihood, deviance and nulldeviance. I tried R and stata and for weights they all give different conflicting results. I have to think about this.....I cannot see the code since R is GPL and Stata is proprietary.

Can you point me to an example of a problematic model? I can have a look, including at the R code, since I'm not the one writing the implementation.

gragusa commented 1 year ago

https://github.com/JuliaStats/GLM.jl/pull/487/files/dbc9ae999d0dd43af023e5b182014fdef7afa41f#r978083032

Isn't this equivalent?

Suggested change

    X = modelmatrix(obj; weighted=isweighted(obj))    
    if k == size(X,2)         
        XtX = crossmodelmatrix(obj; weighted=isweighted(obj))
    X = modelmatrix(obj; weighted=true)    
    if k == size(X, 2)         
        XtX = crossmodelmatrix(obj; weighted=true)

modelmatrix is implemented as

function modelmatrix(obj::LinPredModel; weighted::Bool=false)
    if isweighted(obj)
        mul!(obj.pp.scratchm1, Diagonal(sqrt.(obj.pp.wts)), obj.pp.X)
    else
        obj.pp.X
    end
end

so the keyword argument does not do much.... should I define it as

function modelmatrix(obj::LinPredModel; weighted::Bool=isweighted(obj))
    if weighted
        mul!(obj.pp.scratchm1, Diagonal(sqrt.(obj.pp.wts)), obj.pp.X)
    else
        obj.pp.X
    end
end

After this change they will be equivalent....however, for the unweighted model, is better to specify weighted=FALSE to avoid a matrix multiplication.

https://github.com/JuliaStats/GLM.jl/pull/487/files/dbc9ae999d0dd43af023e5b182014fdef7afa41f#r978083032

For the standard case (homoskedastic errors), invchol returns a (p x p) matrix with NaN for coefficients corresponding to the collinear rows. The formula is $\hat{sigma}^2 (X'X)^{-1}$ and we will get NaN for the varaince/stderrors of these coefficients. For the weighted case, the formula for the variance is $(X'X)^{-1} \sum_{i=1}^n X_i'u_iu_i'X_i (X'X)^{-1}$, without the machinery NaN will propagate through the matrix products resulting in a NaN MATRIX. The machinery removes the NaN elements, computes the matrix products, and then expands the matrix back to its original size. I hope it makes sense.

https://github.com/JuliaStats/GLM.jl/pull/487/files/dbc9ae999d0dd43af023e5b182014fdef7afa41f#r978061661 ...about performance for the unweighted case.

Committed a change

https://github.com/JuliaStats/GLM.jl/pull/487/files/dbc9ae999d0dd43af023e5b182014fdef7afa41f#r958885835

Will take care of this.....needs testing

gragusa commented 1 year ago

Likelihood issue

The likelihood (or quasi-likelihood) used for fitting in all three cases is the same - that's why we get the same point estimates.

The problem is that we can sometimes define post-estimation likelihood that (a little bit ad hoc) try to take into account the nature of the weights. Being ad hoc, they are arbitrary. See this document I made here html. You can see how Stata numbers are not the same with those of R and those of this PR.

Personally, I would define the likelihood in the same way for all weights - more transparent. If we want to define different likelihoods then we have to carefully document what they are (at the moment is not clear what they are in Stata - I know what they are in R).

nalimilan commented 1 year ago

After this change they will be equivalent....however, for the unweighted model, is better to specify weighted=FALSE to avoid a matrix multiplication.

I'd use if weighted && isweighted(x) to avoid the unnecessary multiplication.

For the standard case (homoskedastic errors), invchol returns a (p x p) matrix with NaN for coefficients corresponding to the collinear rows. The formula is...

OK, thanks for the details.

#

The likelihood (or quasi-likelihood) used for fitting in all three cases is the same - that's why we get the same point estimates.

The problem is that we can sometimes define post-estimation likelihood that (a little bit ad hoc) try to take into account the nature of the weights. Being ad hoc, they are arbitrary. See this document I made here html. You can see how Stata numbers are not the same with those of R and those of this PR.

Personally, I would define the likelihood in the same way for all weights - more transparent. If we want to define different likelihoods then we have to carefully document what they are (at the moment is not clear what they are in Stata - I know what they are in R).

AFAICT we can't use the same likelihood definition for all weights types, as they need to be scale-free for probability weights, right? One criterion I can see to choose how to define the likelihood is that the McFadden pseudo-R² for normal family with identity link should be equal to the R² of the equivalent linear model. Does that make sense?

gragusa commented 1 year ago

AFAICT we can't use the same likelihood definition for all weights types, as they need to be scale-free for probability weights, right? One criterion I can see in choosing how to define the likelihood is that the McFadden pseudo-R² for a normal family with an identity link should be equal to the R² of the equivalent linear model. Does that make sense?

They all use the same likelihood up to a normalizing constant (which as you correctly pointed out depends on the scale of the weights). Also, the deviance is the same regardless of the weight type, which is what we have (consistent with R). The deviance is twice the difference between the loglik and the null loglik. Thus, the differences between these likelihoods are due to constant scaling, which would not matter when using the likelihood for comparison.

For instance, R survey package seems to calculate the likelihood as (minus) half the ~likelihood~ deviance (with a deviance that is equal to GLM.jl and R with analytic weights).

nalimilan commented 1 year ago

For instance, R survey package seems to calculate the likelihood as (minus) half the likelihood (with a deviance that is equal to GLM.jl and R with analytic weights).

You mean "half the deviance", right? Indeed I just checked that it's the case with svyglm.

But yeah now I get that the constant factor goes away when computing the ratio of likelihoods so requiring the McFadden pseudo-R² to equal the R² isn't enough to choose a definition.

However, the definition used by svyglm doesn't seem to work well with other pseudo-R², as for example the Cox-Snell pseudo-R²1 - exp(2 * (logLik(nullmodel) - logLik(model)) / nobs(model)) always seem to return 1 in my tests. This means that fitting a model with probability weights all equal to one doesn't give the same Cox-Snell pseudo-R² as fitting an unweighted model. That's problematic, right?

BTW, the R survey package is intended mainly for complex designs (strata, etc.), which means that most standard statistics don't work anyway. This may have lead the package to diverge from standard functions (e.g. logLik prints a warning). Since our scope is more limited here (no complex sampling) we might better use different definitions. Not sure.

gragusa commented 1 year ago

However, the definition used by svyglm doesn't seem to work well with other pseudo-R², as for example the Cox-Snell pseudo-R²1 - exp(2 * (logLik(nullmodel) - logLik(model)) / nobs(model)) always seem to return 1 in my tests. This means that fitting a model with probability weights all equal to one doesn't give the same Cox-Snell pseudo-R² as fitting an unweighted model. That's problematic, right?

If the deviance is given by $$D = -2*(logLik(model)-logLik(nullmodel))$$

Then, $$pseudo-R^{2} = 1 - exp(D)/ nobs(model),$$ which is again equal regardless of the weighting type. I believe the differences between software are due to different implementations and choices of the normalization constant. For instance, STATA for aweights reports deviance using rescaled weights $\bar{w}_i = wi - \frac{\sum{i=1}^n w_i}{n}$.

nalimilan commented 1 year ago

Note that the relation $$D = -2*(logLik(model)-logLik(nullmodel))$$ does not actually hold for even unweighted models (see https://github.com/JuliaStats/GLM.jl/pull/115#discussion_r45355268). And the pseudo-R² is defined in terms of log-likelihood, not in terms of the deviance.

In R, with svyglm, the deviance is the same when fitting an unweighted model and when fitting a model will all weights equal. But the log-likelihood is completely different, since R's glm (and GLM.jl) doesn't define the log-likelihood as $-D/2$. So I don't think we can adopt the definition from svyglm.

Don't you agree that the log-likelihood should be the same when fitting a model without weights and when fitting a model with probability weights being all equal? That seems to imply we should choose the right normalization of weights to ensure that property. A simple solution would be to normalize them to sum to the number of observations. Then we could use the same formula to compute the likelihood as for frequency weights.

For analytic weights the situation is more tricky as the definition isn't completely clear in Stata nor in StatsBase (see https://github.com/JuliaStats/StatsBase.jl/issues/758). But it seems that at least when all weights are equal to one we should get the same results as an unweighted model. Cc: @ParadaCarleton

ParadaCarleton commented 1 year ago

For analytic weights the situation is more tricky as the definition isn't completely clear in Stata nor in StatsBase (see JuliaStats/StatsBase.jl#758). But it seems that at least when all weights are equal to one we should get the same results as an unweighted model. Cc: @ParadaCarleton

I outlined a response there--sorry I forgot about it. Similar comments apply to GLM.jl: I don't think it's necessarily possible to create a single method for glm that deals with analytic weights, unless lots and lots of keywords are added to supply additional information. The problem with analytic weights is there's a lot of assumptions that go into a good meta-analysis. Questions like whether the means and variances can be considered homogenous, or whether there's possibly publication bias, and about which estimators are best (as the choice is highly nontrivial when the data has been pre-aggregated).

At some point, someone will probably have to create a Julia package for meta-analysis, but that will require lots of hard work and research by a professional statistician--probably more than we can fit into a single pull request for GLM.jl.

nalimilan commented 1 year ago

Thanks for commenting. Though in other software (R, Stata...) it's standard to use weighted least squares estimation with analytic weights. For example, in R, ?lm says:

Non-‘NULL’ ‘weights’ can be used to indicate that different
observations have different variances (with the values in
‘weights’ being inversely proportional to the variances); or
equivalently, when the elements of ‘weights’ are positive integers
w_i, that each response y_i is the mean of w_i unit-weight
observations (including the case that there are w_i observations
equal to y_i and the data have been summarized). However, in the
latter case, notice that within-group variation is not used.
Therefore, the sigma estimate and residual degrees of freedom may
be suboptimal; in the case of replication weights, even wrong.
Hence, standard errors and analysis of variance tables should be
treated with care.

Do you think we should provide a way to get the same estimates as R, or is that definition too broken to be useful? If so, with which weights type?

If that's too tricky we could support only UnitWeights, FrequencyWeights and ProbablityWeights for now as the PR is already large.

gragusa commented 1 year ago

The issue in StatsBase.jl and this PR are pretty unrelated. No matter what weights are used, the estimate coefficients (the point estimates) will be the same. They all maximize $$\sum_{i=1}^n w_i \ell(\theta),$$ where $\ell(\theta)$ is the log-likelihood under the model that one will maximize if weights were not provided.

With ProbabilityWeights the variance is different. There is also the issue of what likelihood to report. Again, under ProbabilityWeights the log-likelihood being maximized is a quasi-likelihood, and different software reports different log-likelihood.

~AnaliticWeights~ ImportanceWeights are the most commonly used because they have less structure and are helpful for those cases where the weights are not related to sampling. For instance, in specific applications in labor economics, one weights the observations by hours worked by workers. Or in propensity score-based causal inference, the weights are the probability of being treated.

So we can keep splitting hairs about the weights and their definition, but I think it wold add only complexity. R admits only ~AnaliticWeights~ ImportanceWeights for lm and glm. Other more specialized packages then use glm and lm with ~AnaliticWeights~ ImportanceWeights: the survey package underneath uses glm for svyglm with ~analytic~ importance weights.

nalimilan commented 1 year ago

So we can keep splitting hairs about the weights and their definition, but I think it wold add only complexity.

Ensuring we have correct definitions isn't "splitting hairs". :-) There's no question that point estimates are the same. The issue is whether standard errors and log-likelihood are correct for the definition of weights we declare in StatsBase.

Anyway, if we focus on probability weights first as it's the clearest case, what do you think about my previous question?

Don't you agree that the log-likelihood should be the same when fitting a model without weights and when fitting a model with probability weights being all equal? That seems to imply we should choose the right normalization of weights to ensure that property. A simple solution would be to normalize them to sum to the number of observations. Then we could use the same formula to compute the likelihood as for frequency weights.

gragusa commented 1 year ago

Ensuring we have correct definitions isn't "splitting hairs". :-) There's no question that point estimates are the same. The issue is whether standard errors and log-likelihood are correct for the definition of weights we declare in StatsBase.

Sorry I did not want to be dismissive. From a regression point of view, getting the right definition does not really change much here.

Don't you agree that the log-likelihood should be the same when fitting a model without weights and when fitting a model with probability weights being all equal? That seems to imply we should choose the right normalization of weights to ensure that property. A simple solution would be to normalize them to sum to the number of observations. Then we could use the same formula to compute the likelihood as for frequency weights.

The issue here is complicated. For probability weights, the estimates are NOT the MLE, but simply quasi-mle. So, there isn't a right definition of log-likelihood to return in this case. Think about the simple case of regression: the likelihood we maximize is normal with constant mean, but the true variance is not constant (it actually depends on the estimated parameters). So what we can do is 1) return a likelihood with sigma taken as constant 2) return a likelihood with the estimated variance. Stata does (2), R takes the deviance and divide it by (-2) which in practice consists of doing (1). Both are justifiable in the sense that for nested models, the log-likelihood difference converges as $n\to\infty$, under the null, to a $\chi^2$ distribution. Normalize the weights to sum to the number of observation can be done I guess, but probability weights have a clear interpretation as (inverse) probabilities -- re-normalizing them seems unnatural to me. Personally, I would try to match what Stata is doing, but doing what R (in the survey package) is doing is the easiest to implement. But again, I have not strong feeling about this.

nalimilan commented 1 year ago

OK, thanks. Does the Stata definition ensure that the log-likelihood with all weights equal are equivalent to no weighting? That's really the major property to ensure IMO.

ParadaCarleton commented 1 year ago

I think the issue in StatsBase.jl and this PR are quite unrelated. No matter what weights are used, the estimate coefficients (the point estimates) are going to be the same. They all maximize $$ sum{i=1}^n w_i \ell(\theta), $$ where \el(\thetal is the log-likelihood under the model that one will maximize if weights were not provided.

That's true, but the main concern is that likelihoods, confidence intervals, and standard errors aren't clearly defined. For instance, if weights represent hours worked, then the standard errors should be defined differently from the way they are for meta-analyses. (In addition, the weights should be represented as generic Weights rather than AnalyticWeights.)

One solution is to provide point estimates with AnalyticWeights, but not confidence intervals or standard errors. Later on we can try reporting confidence intervals or likelihoods, as long as we're able to figure out a way to make these unambiguous with keyword arguments. For now, users who understand what their data actually mean can bootstrap their own intervals.

bkamins commented 1 year ago

but not confidence intervals or standard errors.

This was my initial instinct also. To go forward with this PR I would propose to add what we are sure is OK. The disputable things should be opened as separate issues, to first discuss what we want and then we make a PR adding missing pieces.

gragusa commented 1 year ago

That's true, but the main concern is that likelihoods, confidence intervals, and standard errors aren't clearly defined. For instance, if weights represent hours worked, then the standard errors should be defined differently from the way they are for meta-analyses. (In addition, the weights should be represented as generic Weights rather than AnalyticWeights.)

I don't have an opinion about the naming convention, but confidence interval, standard errors, and likelihood are all well defined in this case. I don't really know what standard errors meta analyses use, but here the only definition problem is with probability weights since it changes the log-likelihood. Having GLM.jl not give standard error with certain weights would be very weird and against what all the other statistical software do. For sure I would stop to use it immediately.

bkamins commented 1 year ago

Having GLM.jl not give standard error with certain weights would be very weird and against what all the other statistical software do.

My proposal was to add it in a separate PR if this is a disputable thing. This PR is large so it is usually more practical to break things into multiple PRs.

For sure I would stop to use it immediately.

I do not assume a GLM.jl release would be made before this issue is resolved, so you would not see it as a user. If a user decides to work with master branch then it is a normal thing that not all things are release-ready.

nalimilan commented 1 year ago

@gragusa Do all other implementation give the same results with analytic weights regarding standard errors and log-likelihood? If that's the case then at least we can settle on that and defer the discussion regarding the definition of AnalyticWeights for later. Also maybe leave the implementation of ProbabilityWeights for later so that we can merge this PR, which is already a clear improvement.

ParadaCarleton commented 1 year ago

That's true, but the main concern is that likelihoods, confidence intervals, and standard errors aren't clearly defined. For instance, if weights represent hours worked, then the standard errors should be defined differently from the way they are for meta-analyses. (In addition, the weights should be represented as generic Weights rather than AnalyticWeights.)

I don't have an opinion about the naming convention, but confidence interval, standard errors, and likelihood are all well defined in this case. I don't really know what standard errors meta analyses use, but here the only definition problem is with probability weights since it changes the log-likelihood. Having GLM.jl not give standard error with certain weights would be very weird and against what all the other statistical software do. For sure I would stop to use it immediately.

The standard errors are very different depending on whether they represent importance (e.g. when different workers have different numbers of hours) or sample sizes. When working with meta-analyses, the total precision (inverse variance) should equal the sum of the precision for all different studies, and the user should be providing a vector or matrix of sufficient statistics. Then, the likelihood is exactly equal to the sum of the likelihoods for each variable--assuming that all effects are the same across studies, which is typically not the case.

If you believe this doesn't describe the kinds of weights you need, feel free to implement a new set of weights in StatsBase. That being said, AnalyticWeights are meant to represent meta-analyses, and their treatment is complex enough that I feel they should be left for another PR.

It's definitely possible to calculate a standard error, but this requires additional information other than the weights themselves, such as the within-study variances.

I believe Stata also doesn't calculate confidence intervals or standard errors for some weight types, when the calculation is ambiguous.

gragusa commented 1 year ago

Standard errors are the same. Given that you feel so competent, I will suggest you @ParadaCarleton to draft a different PR. For the moment I am going to close this given that I am not expert enough on the subject.

nalimilan commented 1 year ago

@gragusa Please don't give up now! That would be too bad given that the PR is mostly ready. May I suggest a plan to get this merged?

  1. Let's throw an error when calling loglikelihood for analytic and probabilty weights for now. Anyway these are not needed so often so what matters is getting the rest of the features merged. We can discuss that later.
  2. Can you confirm again that we give the same standard errors as R and Stata for analytic and probability weights? I think that's the case looking at your HTML doc but better be sure of that. If so I think everything is fine (we can discuss another definition of weights later but it's useful to be able to replicate what other software give).
  3. Tests are needed (which was the last blocking point before this debate started).
gragusa commented 1 year ago
  1. Let's throw an error when calling loglikelihood for analytic and probabilty weights for now. Anyway these are not needed so often so what matters is getting the rest of the features merged. We can discuss that later. Sounds good
  2. Can you confirm again that we give the same standard errors as R and Stata for analytic and probability weights? I think that's the case looking at your HTML doc but better be sure of that. If so I think everything is fine (we can discuss another definition of weights later but it's useful to be able to replicate what other software give). I can confirm. Make sure adding more tests (some are already in the PR).
  3. Tests are needed (which was the last blocking point before this debate started). I will add them.

@nalimilan Could you please merge the StatsAPI PR#16?

nalimilan commented 1 year ago

Note that we already have Weights in StatsBase for cases where weights have no precise meaning, so I don't think we need ImportanceWeights. Anyway for such weights, standard errors and all other quantities that depend on the kind of weights should be NaN, right?

gragusa commented 1 year ago

1) I did not use Weights because too generic a name. I can rename them

2) No -- these are the weights for which we know how to build standard errors and likelihood. These are the default weights of R which are called iweight is Stata. The ones we do not know how to calculate standard errors are the AnalyticsWeights weights.

Let me be clear about this:

1) FrequencyWeights -> everything works (std error and likelihoods): match both Stata and R 2) ImportanceWeights -> everything works (std error and likelihoods): match standard errors of R and Stata and R likelihood (the STATA likelihood adds a constant) 3) ProbabilityWeights -> standard errors match R and Stata. The likelihood is not implemented.

AnalyticWeights are not implemented.

gragusa commented 1 year ago

@nalimilan is there a plan to require a higher version of julia?

nalimilan commented 1 year ago

Wow, I hadn't realized that Stata provides standard errors when importance weights are used. I'm surprised, as I thought the equivalent of R's glm with weights was analytic weights in Stata. That seems problematic to me given that there's no clear definition of these weights. Stata defines importance weights as:

iweights, or importance weights, indicate the relative “importance” of the observation. They have no formal statistical definition; this is a catch-all category. Any command that supports iweights will define how they are treated. They are usually intended for use by programmers who want to produce a certain computation.

This post also makes it clear that they are not reliable.

Do you know what's the meaning of standard errors with importance weights in Stata (and in R's glm)? I could find this post about R, but it says that degrees of freedom need to be adjusted to get proper standard errors when observations are the average of multiple observations. So that doesn't seem very useful -- more like a trap.

@nalimilan is there a plan to require a higher version of julia?

I guess we can start requiring 1.6 if that's useful, we just did that in DataFrames.

nalimilan commented 1 year ago

Another possibly interesting reference is Python's statsmodels, which implements frequency and "variance" (analytical) weights.

gragusa commented 1 year ago

Another possibly interesting reference is Python's statsmodels, which implements frequency and "variance" (analytical) weights.

What they call var_weights are our ImportanceWeights. Their description of the weights is just wrong.

import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
from scipy import stats
from rdatasets import data
engel = data("quantreg", "engel")

endog = engel.income
exogg = engel.foodexp
w = np.log(3*engel.income)

mod2 = smf.glm(formula="foodexp~income", data=engel, var_weights=w, family=sm.families.Gaussian()).fit()

print(mod2.summary())
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                foodexp   No. Observations:                  235
Model:                            GLM   Df Residuals:                      233
Model Family:                Gaussian   Df Model:                            1
Link Function:               identity   Scale:                      1.1151e+05
Method:                          IRLS   Log-Likelihood:                -1455.6
Date:                dom, 23 ott 2022   Deviance:                   2.5982e+07
Time:                        18:29:01   Pearson chi2:                 2.60e+07
No. Iterations:                     3   Pseudo R-squ. (CS):             0.9910
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    156.2304     16.432      9.508      0.000     124.025     188.436
income         0.4774      0.014     33.267      0.000       0.449       0.506
==============================================================================

In Julia

using GLM
using RDatasets

engel = RDatasets.dataset("quantreg", "engel")
engel.w = log.(3.0.*engel.Income)^C

GLM.glm(@formula(FoodExp~Income), engel, Normal(); wts=iweights(engel.w))
julia> GLM.glm(@formula(FoodExp~Income), engel, Normal(); wts=ImportanceWeights(engel.w))
StatsModels.TableRegressionModel{GeneralizedLinearModel{GLM.GlmResp{Vector{Float64}, Normal{Float64}, IdentityLink, ImportanceWeights{Float64, Float64, Vector{Float64}}}, GLM.DensePredChol{Float64, LinearAlgebra.Cholesky{Float64, Matrix{Float64}}, ImportanceWeights{Float64, Float64, Vector{Float64}}}}, Matrix{Float64}}

FoodExp ~ 1 + Income

Coefficients:
───────────────────────────────────────────────────────────────────────────
                 Coef.  Std. Error      z  Pr(>|z|)   Lower 95%   Upper 95%
───────────────────────────────────────────────────────────────────────────
(Intercept)  156.23     16.4316      9.51    <1e-20  124.025     188.436
Income         0.47739   0.0143501  33.27    <1e-99    0.449264    0.505516
───────────────────────────────────────────────────────────────────────────

and

julia> loglikelihood(a)
-1455.5651152380574
ParadaCarleton commented 1 year ago

I think something like importance weights could be useful, but they'd probably need more documentation or specifics. I'm not sure what they're supposed to mean or represent.

Something interesting I noticed is that the example of weighting, where we want to weight by number of hours worked, is actually a set of probability weights, although quite a weird one. You can think of it as wanting to sample one hour of work uniformly at random from the population of all hours worked; because we have a uniform random sample of workers instead, each work hour's probability of inclusion is one over the total number of hours worked by the employee.

nalimilan commented 1 year ago

CI failures on Julia 1.0 can be fixed by requiring Julia 1.6 (more and more packages have started doing that).