Open gragusa opened 2 years ago
Patch coverage: 84.08
% and project coverage change: -2.67
:warning:
Comparison is base (
c13577e
) 90.48% compared to head (fa63a9a
) 87.82%.:exclamation: Current head fa63a9a differs from pull request most recent head 807731a. Consider uploading reports for the commit 807731a to get more accurate results
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.
Hey,
Would that fix the issue I am having, which is that if rows of the data contains missing values, GLM discard those rows, but does not discard the corresponding values of df.weights
and then yells that there are too many weights ?
I think the interfacing should allow for a DataFrame input of weights, that would take care of such things (like it does for the other variables).
Would that fix the issue I am having, which is that if rows of the data contains missing values, GLM discard those rows, but does not discard the corresponding values of
df.weights
and then yells that there are too many weights ?
not really. But it would be easy to make this a feature. But before digging further on this I would like to know whether there is consensus on the approach of this PR.
FYI this appears to fix #420; a PR was started in #432 and the author closed for lack of time on their part to investigate CI failures.
Here's the test case pulled from #432 which passes with the in #487.
@testset "collinearity and weights" begin
rng = StableRNG(1234321)
x1 = randn(100)
x1_2 = 3 * x1
x2 = 10 * randn(100)
x2_2 = -2.4 * x2
y = 1 .+ randn() * x1 + randn() * x2 + 2 * randn(100)
df = DataFrame(y = y, x1 = x1, x2 = x1_2, x3 = x2, x4 = x2_2, weights = repeat([1, 0.5],50))
f = @formula(y ~ x1 + x2 + x3 + x4)
lm_model = lm(f, df, wts = df.weights)#, dropcollinear = true)
X = [ones(length(y)) x1_2 x2_2]
W = Diagonal(df.weights)
coef_naive = (X'W*X)\X'W*y
@test lm_model.model.pp.chol isa CholeskyPivoted
@test rank(lm_model.model.pp.chol) == 3
@test isapprox(filter(!=(0.0), coef(lm_model)), coef_naive)
end
Can this test set be added?
Is there any other feedback for @gragusa ? It would be great to get this merged if good to go.
Sorry for the long delay, I hadn't realized you were waiting for feedback. Looks great overall, please feel free to finish it! I'll try to find the time to make more specific comments.
A very nice PR. In the tests can we have some test set that compares the results of aweights
, fweights
, and pweights
for the same set of data (coeffs, predictions, covariance matrix of the estimates, p-values etc.).
@nalimilan thanks for the careful review - I will make the suggestions part of the PR.
As for the testing - as suggested by @bkamins - probably we can use the same data to test everything with different weights. I will try to do this.
Mas for the QR: I want to add the pivoted QR - however I am not sure how to do this … it is easy to write a rank revealing pivoted qr decomposition, but I am not sure how to use to obtain the solution for the lin indep columns of the system.
Related to pivoting: something we should definitely add is the possibility for crossmatrix e modelmatrix to return the linear independent columns together with the indexes in the original full matrix. This would fix cookdistance (which only work form full rank designs) and also make momentmatrix useful for rank deficient. Is there any thoughts about this?
@nalimilan or somebody with GitHub fu, Accidentally sync my fork at gragusa/GLM.jl
with JuliaStats/GLM:master
. I have the commits in a branch gragusa/GLM.jl/JuliaStats-master
. But if I push changes to this branch, the PR is not updated. Any help?
Are you sure? I see the same commits here and on the branch (Throw error if GlmResp are not AbastractWeights).
Mas for the QR: I want to add the pivoted QR - however I am not sure how to do this … it is easy to write a rank revealing pivoted qr decomposition, but I am not sure how to use to obtain the solution for the lin indep columns of the system.
I haven't investigated this at all, but shouldn't this be similar to how it works for Cholesky decomposition?
Related to pivoting: something we should definitely add is the possibility for crossmatrix e modelmatrix to return the linear independent columns together with the indexes in the original full matrix. This would fix cookdistance (which only work form full rank designs) and also make momentmatrix useful for rank deficient. Is there any thoughts about this?
That's tricky. Let's discuss this in a separate issue as we have enough on our plate here. ;-)
Are you sure? I see the same commits here and on the branch (Throw error if GlmResp are not AbastractWeights).
Yes. I can see the commit as well!
That's tricky. Let's discuss this in a separate issue as we have enough on our plate here. ;-)
Ok. I will work on it, once I tame this monster here!
I updated the documentation, but while it is correctly built locally, I see failures on checks here.
The reason is that the type signature for GLM and LM differ from the one I am getting locally. Locally I am using Julia 1.7 - the doc GitHub action probably uses Julia 1.8. Is it possible that type signature are different between versions? I will try later to build locally with 1.8 and update the doc in case that’s the reason.
Is it possible that type signature are different between versions?
Yes, it is possible. Output changes are not considered breaking between Julia versions (but I have not checked if this is the case here; another common reason of such problems arises if locally you have a different set of packages loaded than on CI - then the prefixes of types might change)
No itis not the prefix, is the CholPred signature output is different in the CI than locally and btw, the output seems to be wrong and messing with the curly brackets. But I am checking this carefully
Get Outlook for iOShttps://aka.ms/o0ukef
From: Bogumił Kamiński @.> Sent: Thursday, September 8, 2022 1:48:45 PM To: JuliaStats/GLM.jl @.> Cc: Giuseppe Ragusa @.>; Mention @.> Subject: Re: [JuliaStats/GLM.jl] Taking weighting seriously (PR #487)
Is it possible that type signature are different between versions?
Yes, it is possible. Output changes are not considered breaking between Julia versions (but I have not checked if this is the case here; another common reason of such problems arises if locally you have a different set of packages loaded than on CI - then the prefixes of types might change)
— Reply to this email directly, view it on GitHubhttps://github.com/JuliaStats/GLM.jl/pull/487#issuecomment-1240609321, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAD5DAT4O6J2F3FSL2OCPT3V5HHB3ANCNFSM53WBWMMQ. You are receiving this because you were mentioned.Message ID: @.***>
Now I have looked at the output - it seems that the order of 2 type parameters is reversed.
Now I have looked at the output - it seems that the order of 2 type parameters is reversed.
What happened is that CholeskyPivoted
changed the signature between versions
1.7
CholeskyPivoted{Float64, Matrix{Float64}}
1.8
CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}
@nalimilan what do you think about this?
@nalimilan I committed your suggested changes. I see you bumped several review comments that i had pervasively addressed (or answered). I am going to report these here:
1) https://github.com/JuliaStats/GLM.jl/pull/487#discussion_r985297147
Why dof_residual
changed type ? Because UnitWeights
is of eltype
Int
(previously no weighting meant an empty float
array.
2) https://github.com/JuliaStats/GLM.jl/pull/487#discussion_r985297465 What is the question you are referring to?
3) https://github.com/JuliaStats/GLM.jl/pull/487#discussion_r985297566
and other I think.
As far as the general PR: I am very very puzzled by the loglikelihood
, deviance
and nulldeviance
. I tried R
and stata
and for weights
they all give different conflicting results. I have to think about this.....I cannot see the code since R is GPL and Stata is proprietary.
Ah sorry the GitHub UI doesn't show the threads in which I added the comments. These are: https://github.com/JuliaStats/GLM.jl/pull/487/files/dbc9ae999d0dd43af023e5b182014fdef7afa41f#r958885835 https://github.com/JuliaStats/GLM.jl/pull/487/files/dbc9ae999d0dd43af023e5b182014fdef7afa41f#r978083032 https://github.com/JuliaStats/GLM.jl/pull/487/files/dbc9ae999d0dd43af023e5b182014fdef7afa41f#r978061661 https://github.com/JuliaStats/GLM.jl/pull/487/files/dbc9ae999d0dd43af023e5b182014fdef7afa41f#r978063134
- Why
dof_residual
changed type ? BecauseUnitWeights
is ofeltype
Int
(previously no weighting meant an emptyfloat
array.
Ah, I see, I think I originally defined nobs
to always return a float to make the function type-stable, as the presence/absence of weights was not reflected in the model type. Now this is OK thanks to weight vector types.
As far as the general PR: I am very very puzzled by the
loglikelihood
,deviance
andnulldeviance
. I triedR
andstata
and forweights
they all give different conflicting results. I have to think about this.....I cannot see the code since R is GPL and Stata is proprietary.
Can you point me to an example of a problematic model? I can have a look, including at the R code, since I'm not the one writing the implementation.
Isn't this equivalent?
Suggested change
X = modelmatrix(obj; weighted=isweighted(obj))
if k == size(X,2)
XtX = crossmodelmatrix(obj; weighted=isweighted(obj))
X = modelmatrix(obj; weighted=true)
if k == size(X, 2)
XtX = crossmodelmatrix(obj; weighted=true)
modelmatrix
is implemented as
function modelmatrix(obj::LinPredModel; weighted::Bool=false)
if isweighted(obj)
mul!(obj.pp.scratchm1, Diagonal(sqrt.(obj.pp.wts)), obj.pp.X)
else
obj.pp.X
end
end
so the keyword argument does not do much.... should I define it as
function modelmatrix(obj::LinPredModel; weighted::Bool=isweighted(obj))
if weighted
mul!(obj.pp.scratchm1, Diagonal(sqrt.(obj.pp.wts)), obj.pp.X)
else
obj.pp.X
end
end
After this change they will be equivalent....however, for the unweighted model, is better to specify weighted=FALSE
to avoid a matrix multiplication.
For the standard case (homoskedastic errors), invchol
returns a (p x p) matrix with NaN
for coefficients corresponding to the collinear rows. The formula is $\hat{sigma}^2 (X'X)^{-1}$ and we will get NaN
for the varaince/stderrors of these coefficients. For the weighted case, the formula for the variance is $(X'X)^{-1} \sum_{i=1}^n X_i'u_iu_i'X_i (X'X)^{-1}$, without the machinery NaN
will propagate through the matrix products resulting in a NaN
MATRIX. The machinery removes the NaN elements, computes the matrix products, and then expands the matrix back to its original size. I hope it makes sense.
https://github.com/JuliaStats/GLM.jl/pull/487/files/dbc9ae999d0dd43af023e5b182014fdef7afa41f#r978061661 ...about performance for the unweighted case.
Committed a change
Will take care of this.....needs testing
The likelihood (or quasi-likelihood) used for fitting in all three cases is the same - that's why we get the same point estimates.
The problem is that we can sometimes define post-estimation likelihood that (a little bit ad hoc) try to take into account the nature of the weights. Being ad hoc, they are arbitrary. See this document I made here html. You can see how Stata numbers are not the same with those of R and those of this PR.
Personally, I would define the likelihood in the same way for all weights - more transparent. If we want to define different likelihoods then we have to carefully document what they are (at the moment is not clear what they are in Stata - I know what they are in R
).
After this change they will be equivalent....however, for the unweighted model, is better to specify
weighted=FALSE
to avoid a matrix multiplication.
I'd use if weighted && isweighted(x)
to avoid the unnecessary multiplication.
For the standard case (homoskedastic errors),
invchol
returns a (p x p) matrix withNaN
for coefficients corresponding to the collinear rows. The formula is...
OK, thanks for the details.
#
The likelihood (or quasi-likelihood) used for fitting in all three cases is the same - that's why we get the same point estimates.
The problem is that we can sometimes define post-estimation likelihood that (a little bit ad hoc) try to take into account the nature of the weights. Being ad hoc, they are arbitrary. See this document I made here html. You can see how Stata numbers are not the same with those of R and those of this PR.
Personally, I would define the likelihood in the same way for all weights - more transparent. If we want to define different likelihoods then we have to carefully document what they are (at the moment is not clear what they are in Stata - I know what they are in
R
).
AFAICT we can't use the same likelihood definition for all weights types, as they need to be scale-free for probability weights, right? One criterion I can see to choose how to define the likelihood is that the McFadden pseudo-R² for normal family with identity link should be equal to the R² of the equivalent linear model. Does that make sense?
AFAICT we can't use the same likelihood definition for all weights types, as they need to be scale-free for probability weights, right? One criterion I can see in choosing how to define the likelihood is that the McFadden pseudo-R² for a normal family with an identity link should be equal to the R² of the equivalent linear model. Does that make sense?
They all use the same likelihood up to a normalizing constant (which as you correctly pointed out depends on the scale of the weights). Also, the deviance is the same regardless of the weight type, which is what we have (consistent with R). The deviance is twice the difference between the loglik
and the null loglik
. Thus, the differences between these likelihoods are due to constant scaling, which would not matter when using the likelihood for comparison.
For instance, R
survey
package seems to calculate the likelihood as (minus) half the ~likelihood~ deviance (with a deviance that is equal to GLM.jl
and R
with analytic weights).
For instance,
R
survey
package seems to calculate the likelihood as (minus) half the likelihood (with a deviance that is equal toGLM.jl
andR
with analytic weights).
You mean "half the deviance", right? Indeed I just checked that it's the case with svyglm
.
But yeah now I get that the constant factor goes away when computing the ratio of likelihoods so requiring the McFadden pseudo-R² to equal the R² isn't enough to choose a definition.
However, the definition used by svyglm
doesn't seem to work well with other pseudo-R², as for example the Cox-Snell pseudo-R²1 - exp(2 * (logLik(nullmodel) - logLik(model)) / nobs(model))
always seem to return 1 in my tests. This means that fitting a model with probability weights all equal to one doesn't give the same Cox-Snell pseudo-R² as fitting an unweighted model. That's problematic, right?
BTW, the R survey package is intended mainly for complex designs (strata, etc.), which means that most standard statistics don't work anyway. This may have lead the package to diverge from standard functions (e.g. logLik
prints a warning). Since our scope is more limited here (no complex sampling) we might better use different definitions. Not sure.
However, the definition used by
svyglm
doesn't seem to work well with other pseudo-R², as for example the Cox-Snell pseudo-R²1 - exp(2 * (logLik(nullmodel) - logLik(model)) / nobs(model))
always seem to return 1 in my tests. This means that fitting a model with probability weights all equal to one doesn't give the same Cox-Snell pseudo-R² as fitting an unweighted model. That's problematic, right?
If the deviance is given by $$D = -2*(logLik(model)-logLik(nullmodel))$$
Then,
$$pseudo-R^{2} = 1 - exp(D)/ nobs(model),$$ which is again equal regardless of the weighting type. I believe the differences between software are due to different implementations and choices of the normalization constant. For instance, STATA for aweights
reports deviance using rescaled weights $\bar{w}_i = wi - \frac{\sum{i=1}^n w_i}{n}$.
Note that the relation $$D = -2*(logLik(model)-logLik(nullmodel))$$ does not actually hold for even unweighted models (see https://github.com/JuliaStats/GLM.jl/pull/115#discussion_r45355268). And the pseudo-R² is defined in terms of log-likelihood, not in terms of the deviance.
In R, with svyglm
, the deviance is the same when fitting an unweighted model and when fitting a model will all weights equal. But the log-likelihood is completely different, since R's glm
(and GLM.jl) doesn't define the log-likelihood as $-D/2$. So I don't think we can adopt the definition from svyglm
.
Don't you agree that the log-likelihood should be the same when fitting a model without weights and when fitting a model with probability weights being all equal? That seems to imply we should choose the right normalization of weights to ensure that property. A simple solution would be to normalize them to sum to the number of observations. Then we could use the same formula to compute the likelihood as for frequency weights.
For analytic weights the situation is more tricky as the definition isn't completely clear in Stata nor in StatsBase (see https://github.com/JuliaStats/StatsBase.jl/issues/758). But it seems that at least when all weights are equal to one we should get the same results as an unweighted model. Cc: @ParadaCarleton
For analytic weights the situation is more tricky as the definition isn't completely clear in Stata nor in StatsBase (see JuliaStats/StatsBase.jl#758). But it seems that at least when all weights are equal to one we should get the same results as an unweighted model. Cc: @ParadaCarleton
I outlined a response there--sorry I forgot about it. Similar comments apply to GLM.jl: I don't think it's necessarily possible to create a single method for glm
that deals with analytic weights, unless lots and lots of keywords are added to supply additional information. The problem with analytic weights is there's a lot of assumptions that go into a good meta-analysis. Questions like whether the means and variances can be considered homogenous, or whether there's possibly publication bias, and about which estimators are best (as the choice is highly nontrivial when the data has been pre-aggregated).
At some point, someone will probably have to create a Julia package for meta-analysis, but that will require lots of hard work and research by a professional statistician--probably more than we can fit into a single pull request for GLM.jl.
Thanks for commenting. Though in other software (R, Stata...) it's standard to use weighted least squares estimation with analytic weights. For example, in R, ?lm
says:
Non-‘NULL’ ‘weights’ can be used to indicate that different observations have different variances (with the values in ‘weights’ being inversely proportional to the variances); or equivalently, when the elements of ‘weights’ are positive integers w_i, that each response y_i is the mean of w_i unit-weight observations (including the case that there are w_i observations equal to y_i and the data have been summarized). However, in the latter case, notice that within-group variation is not used. Therefore, the sigma estimate and residual degrees of freedom may be suboptimal; in the case of replication weights, even wrong. Hence, standard errors and analysis of variance tables should be treated with care.
Do you think we should provide a way to get the same estimates as R, or is that definition too broken to be useful? If so, with which weights type?
If that's too tricky we could support only UnitWeights
, FrequencyWeights
and ProbablityWeights
for now as the PR is already large.
The issue in StatsBase.jl and this PR are pretty unrelated. No matter what weights are used, the estimate coefficients (the point estimates) will be the same. They all maximize $$\sum_{i=1}^n w_i \ell(\theta),$$ where $\ell(\theta)$ is the log-likelihood under the model that one will maximize if weights were not provided.
With ProbabilityWeights
the variance is different. There is also the issue of what likelihood to report. Again, under ProbabilityWeights
the log-likelihood being maximized is a quasi-likelihood, and different software reports different log-likelihood.
~AnaliticWeights
~ ImportanceWeights
are the most commonly used because they have less structure and are helpful for those cases where the weights are not related to sampling. For instance, in specific applications in labor economics, one weights the observations by hours worked by workers. Or in propensity score-based causal inference, the weights are the probability of being treated.
So we can keep splitting hairs about the weights and their definition, but I think it wold add only complexity. R
admits only ~AnaliticWeights
~ ImportanceWeights
for lm
and glm
. Other more specialized packages then use glm
and lm
with ~AnaliticWeights
~ ImportanceWeights
: the survey
package underneath uses glm
for svyglm
with ~analytic~ importance weights.
So we can keep splitting hairs about the weights and their definition, but I think it wold add only complexity.
Ensuring we have correct definitions isn't "splitting hairs". :-) There's no question that point estimates are the same. The issue is whether standard errors and log-likelihood are correct for the definition of weights we declare in StatsBase.
Anyway, if we focus on probability weights first as it's the clearest case, what do you think about my previous question?
Don't you agree that the log-likelihood should be the same when fitting a model without weights and when fitting a model with probability weights being all equal? That seems to imply we should choose the right normalization of weights to ensure that property. A simple solution would be to normalize them to sum to the number of observations. Then we could use the same formula to compute the likelihood as for frequency weights.
Ensuring we have correct definitions isn't "splitting hairs". :-) There's no question that point estimates are the same. The issue is whether standard errors and log-likelihood are correct for the definition of weights we declare in StatsBase.
Sorry I did not want to be dismissive. From a regression point of view, getting the right definition does not really change much here.
Don't you agree that the log-likelihood should be the same when fitting a model without weights and when fitting a model with probability weights being all equal? That seems to imply we should choose the right normalization of weights to ensure that property. A simple solution would be to normalize them to sum to the number of observations. Then we could use the same formula to compute the likelihood as for frequency weights.
The issue here is complicated. For probability weights, the estimates are NOT the MLE, but simply quasi-mle. So, there isn't a right definition of log-likelihood to return in this case. Think about the simple case of regression: the likelihood we maximize is normal with constant mean, but the true variance is not constant (it actually depends on the estimated parameters). So what we can do is 1) return a likelihood with sigma taken as constant 2) return a likelihood with the estimated variance. Stata
does (2), R
takes the deviance and divide it by (-2) which in practice consists of doing (1). Both are justifiable in the sense that for nested models, the log-likelihood difference converges as $n\to\infty$, under the null, to a $\chi^2$ distribution. Normalize the weights to sum to the number of observation can be done I guess, but probability weights have a clear interpretation as (inverse) probabilities -- re-normalizing them seems unnatural to me. Personally, I would try to match what Stata is doing, but doing what R (in the survey
package) is doing is the easiest to implement. But again, I have not strong feeling about this.
OK, thanks. Does the Stata definition ensure that the log-likelihood with all weights equal are equivalent to no weighting? That's really the major property to ensure IMO.
I think the issue in StatsBase.jl and this PR are quite unrelated. No matter what weights are used, the estimate coefficients (the point estimates) are going to be the same. They all maximize $$ sum{i=1}^n w_i \ell(\theta), $$ where \el(\thetal is the log-likelihood under the model that one will maximize if weights were not provided.
That's true, but the main concern is that likelihoods, confidence intervals, and standard errors aren't clearly defined. For instance, if weights represent hours worked, then the standard errors should be defined differently from the way they are for meta-analyses. (In addition, the weights should be represented as generic Weights
rather than AnalyticWeights
.)
One solution is to provide point estimates with AnalyticWeights
, but not confidence intervals or standard errors. Later on we can try reporting confidence intervals or likelihoods, as long as we're able to figure out a way to make these unambiguous with keyword arguments. For now, users who understand what their data actually mean can bootstrap their own intervals.
but not confidence intervals or standard errors.
This was my initial instinct also. To go forward with this PR I would propose to add what we are sure is OK. The disputable things should be opened as separate issues, to first discuss what we want and then we make a PR adding missing pieces.
That's true, but the main concern is that likelihoods, confidence intervals, and standard errors aren't clearly defined. For instance, if weights represent hours worked, then the standard errors should be defined differently from the way they are for meta-analyses. (In addition, the weights should be represented as generic
Weights
rather thanAnalyticWeights
.)
I don't have an opinion about the naming convention, but confidence interval, standard errors, and likelihood are all well defined in this case. I don't really know what standard errors meta analyses use, but here the only definition problem is with probability weights since it changes the log-likelihood. Having GLM.jl
not give standard error with certain weights would be very weird and against what all the other statistical software do. For sure I would stop to use it immediately.
Having GLM.jl not give standard error with certain weights would be very weird and against what all the other statistical software do.
My proposal was to add it in a separate PR if this is a disputable thing. This PR is large so it is usually more practical to break things into multiple PRs.
For sure I would stop to use it immediately.
I do not assume a GLM.jl release would be made before this issue is resolved, so you would not see it as a user. If a user decides to work with master
branch then it is a normal thing that not all things are release-ready.
@gragusa Do all other implementation give the same results with analytic weights regarding standard errors and log-likelihood? If that's the case then at least we can settle on that and defer the discussion regarding the definition of AnalyticWeights
for later. Also maybe leave the implementation of ProbabilityWeights
for later so that we can merge this PR, which is already a clear improvement.
That's true, but the main concern is that likelihoods, confidence intervals, and standard errors aren't clearly defined. For instance, if weights represent hours worked, then the standard errors should be defined differently from the way they are for meta-analyses. (In addition, the weights should be represented as generic
Weights
rather thanAnalyticWeights
.)I don't have an opinion about the naming convention, but confidence interval, standard errors, and likelihood are all well defined in this case. I don't really know what standard errors meta analyses use, but here the only definition problem is with probability weights since it changes the log-likelihood. Having
GLM.jl
not give standard error with certain weights would be very weird and against what all the other statistical software do. For sure I would stop to use it immediately.
The standard errors are very different depending on whether they represent importance (e.g. when different workers have different numbers of hours) or sample sizes. When working with meta-analyses, the total precision (inverse variance) should equal the sum of the precision for all different studies, and the user should be providing a vector or matrix of sufficient statistics. Then, the likelihood is exactly equal to the sum of the likelihoods for each variable--assuming that all effects are the same across studies, which is typically not the case.
If you believe this doesn't describe the kinds of weights you need, feel free to implement a new set of weights in StatsBase. That being said, AnalyticWeights
are meant to represent meta-analyses, and their treatment is complex enough that I feel they should be left for another PR.
It's definitely possible to calculate a standard error, but this requires additional information other than the weights themselves, such as the within-study variances.
I believe Stata also doesn't calculate confidence intervals or standard errors for some weight types, when the calculation is ambiguous.
Standard errors are the same. Given that you feel so competent, I will suggest you @ParadaCarleton to draft a different PR. For the moment I am going to close this given that I am not expert enough on the subject.
@gragusa Please don't give up now! That would be too bad given that the PR is mostly ready. May I suggest a plan to get this merged?
loglikelihood
for analytic and probabilty weights for now. Anyway these are not needed so often so what matters is getting the rest of the features merged. We can discuss that later.
- Let's throw an error when calling
loglikelihood
for analytic and probabilty weights for now. Anyway these are not needed so often so what matters is getting the rest of the features merged. We can discuss that later. Sounds good- Can you confirm again that we give the same standard errors as R and Stata for analytic and probability weights? I think that's the case looking at your HTML doc but better be sure of that. If so I think everything is fine (we can discuss another definition of weights later but it's useful to be able to replicate what other software give). I can confirm. Make sure adding more tests (some are already in the PR).
- Tests are needed (which was the last blocking point before this debate started). I will add them.
@nalimilan Could you please merge the StatsAPI
PR#16?
Note that we already have Weights
in StatsBase for cases where weights have no precise meaning, so I don't think we need ImportanceWeights
. Anyway for such weights, standard errors and all other quantities that depend on the kind of weights should be NaN
, right?
1) I did not use Weights
because too generic a name. I can rename them
2) No -- these are the weights for which we know how to build standard errors and likelihood. These are the default weights of R
which are called iweight
is Stata.
The ones we do not know how to calculate standard errors are the AnalyticsWeights
weights.
Let me be clear about this:
1) FrequencyWeights
-> everything works (std error and likelihoods): match both Stata
and R
2) ImportanceWeights
-> everything works (std error and likelihoods): match standard errors of R
and Stata
and R
likelihood (the STATA likelihood adds a constant)
3) ProbabilityWeights
-> standard errors match R
and Stata
. The likelihood is not implemented.
AnalyticWeights
are not implemented.
@nalimilan is there a plan to require a higher version of julia
?
Wow, I hadn't realized that Stata provides standard errors when importance weights are used. I'm surprised, as I thought the equivalent of R's glm
with weights was analytic weights in Stata. That seems problematic to me given that there's no clear definition of these weights. Stata defines importance weights as:
iweights, or importance weights, indicate the relative “importance” of the observation. They have no formal statistical definition; this is a catch-all category. Any command that supports iweights will define how they are treated. They are usually intended for use by programmers who want to produce a certain computation.
This post also makes it clear that they are not reliable.
Do you know what's the meaning of standard errors with importance weights in Stata (and in R's glm
)? I could find this post about R, but it says that degrees of freedom need to be adjusted to get proper standard errors when observations are the average of multiple observations. So that doesn't seem very useful -- more like a trap.
@nalimilan is there a plan to require a higher version of
julia
?
I guess we can start requiring 1.6 if that's useful, we just did that in DataFrames.
Another possibly interesting reference is Python's statsmodels, which implements frequency and "variance" (analytical) weights.
Another possibly interesting reference is Python's statsmodels, which implements frequency and "variance" (analytical) weights.
What they call var_weights
are our ImportanceWeights
. Their description of the weights is just wrong.
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
from scipy import stats
from rdatasets import data
engel = data("quantreg", "engel")
endog = engel.income
exogg = engel.foodexp
w = np.log(3*engel.income)
mod2 = smf.glm(formula="foodexp~income", data=engel, var_weights=w, family=sm.families.Gaussian()).fit()
print(mod2.summary())
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: foodexp No. Observations: 235
Model: GLM Df Residuals: 233
Model Family: Gaussian Df Model: 1
Link Function: identity Scale: 1.1151e+05
Method: IRLS Log-Likelihood: -1455.6
Date: dom, 23 ott 2022 Deviance: 2.5982e+07
Time: 18:29:01 Pearson chi2: 2.60e+07
No. Iterations: 3 Pseudo R-squ. (CS): 0.9910
Covariance Type: nonrobust
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 156.2304 16.432 9.508 0.000 124.025 188.436
income 0.4774 0.014 33.267 0.000 0.449 0.506
==============================================================================
In Julia
using GLM
using RDatasets
engel = RDatasets.dataset("quantreg", "engel")
engel.w = log.(3.0.*engel.Income)^C
GLM.glm(@formula(FoodExp~Income), engel, Normal(); wts=iweights(engel.w))
julia> GLM.glm(@formula(FoodExp~Income), engel, Normal(); wts=ImportanceWeights(engel.w))
StatsModels.TableRegressionModel{GeneralizedLinearModel{GLM.GlmResp{Vector{Float64}, Normal{Float64}, IdentityLink, ImportanceWeights{Float64, Float64, Vector{Float64}}}, GLM.DensePredChol{Float64, LinearAlgebra.Cholesky{Float64, Matrix{Float64}}, ImportanceWeights{Float64, Float64, Vector{Float64}}}}, Matrix{Float64}}
FoodExp ~ 1 + Income
Coefficients:
───────────────────────────────────────────────────────────────────────────
Coef. Std. Error z Pr(>|z|) Lower 95% Upper 95%
───────────────────────────────────────────────────────────────────────────
(Intercept) 156.23 16.4316 9.51 <1e-20 124.025 188.436
Income 0.47739 0.0143501 33.27 <1e-99 0.449264 0.505516
───────────────────────────────────────────────────────────────────────────
and
julia> loglikelihood(a)
-1455.5651152380574
I think something like importance weights could be useful, but they'd probably need more documentation or specifics. I'm not sure what they're supposed to mean or represent.
Something interesting I noticed is that the example of weighting, where we want to weight by number of hours worked, is actually a set of probability weights, although quite a weird one. You can think of it as wanting to sample one hour of work uniformly at random from the population of all hours worked; because we have a uniform random sample of workers instead, each work hour's probability of inclusion is one over the total number of hours worked by the employee.
CI failures on Julia 1.0 can be fixed by requiring Julia 1.6 (more and more packages have started doing that).
This PR addresses several problems with the current GLM implementation.
Current status In master, GLM/LM only accepts weights through the keyword
wts
. These weights are implicitly frequency weights.With this PR FrequencyWeights, AnalyticWeights, and ProbabilityWeights are possible. The API is the following
The old behavior -- passing a vector
wts=df.wts
is deprecated and for the moment, the array os coerceddf.wts
to FrequencyWeights.To allow dispatching on the weights,
CholPred
takes a parameterT<:AbstractWeights
. The unweighted LM/GLM has UnitWeights as the parameter for the type.This PR also implements
residuals(r::RegressionModel; weighted::Bool=false)
andmodelmatrix(r::RegressionModel; weighted::Bool = false)
. The new signature for these two methods is pending in StatsApi.There are many changes that I had to make to make everything work. Tests are passing, but some new feature needs new tests. Before implementing them, I wanted to ensure that the approach taken was liked.
I have also implemented
momentmatrix
, which returns the estimating function of the estimator. I arrived to the conclusion that it does not make sense to have a keyword argumentweighted
. Thus I will amend https://github.com/JuliaStats/StatsAPI.jl/pull/16 to remove such a keyword from the signature.Update
I think I covered all the suggestions/comments with this exception as I have to think about it. Maybe this can be addressed later. The new standard errors (the one for
ProbabilityWeights
) also work in the rank deficient case (and so doescooksdistance
).Tests are passing and I think they cover everything that I have implemented. Also, added a section in the documentation about using
Weights
and updatedjldoc
with the new signature ofCholeskyPivoted
.To do:
Closes https://github.com/JuliaStats/GLM.jl/issues/186.