Additional Methods for RegressionModel

Nosferican commented 6 years ago

I would like to Feature Request the following methods:

StatsBase.model_matrix(obj::RegressionModel)
StatsBase.coefnames(obj::RegressionModel)
StatsBase.wald_test(obj::RegressionModel;
    restrictions::Vector{Symbol} = setdiff(StatsBase.coefnames(obj), ["(Intercept)"]),
    Vcov::AbstractMatrix = vcov(obj),
    rdf::Integer = StatsBase.dof_residual(obj))

model_matrix would return a the design matrix as a Matrix{Float64}.

This would allow various packages to access the field with a standard API instead of hunting down the structure of the structs.
coefnames would return the names of the coefficients as Vector{Symbol}.

It mirrors DataFrames.coefnames(obj::DataFrames.ModelFrame)
wald_test would return the Wald statistic for those restrictions as a Dict{Symbol,Any} with keys: Statistic (Value of test), p_value (p-value based on the F-Distribution with number of restrictions and residual degrees of freedom), and possibly some way to display the restrictions.

It would allow for a standard and automated default method rather than have each package define it (redundant code). Currently the most useful syntax is to especially which coefficients are jointly significant (different from zero) with the default being all, but the intercept if included. However, for being comprehensive a different syntax would have to be worked in order to allow for any valid linear combination restrictions, (e.g., coef(obj)[1] == coef(obj)[2] and coef(obj)[3] == 2 * coef(obj)[4]). The rdf keyword parameter allows for degrees of freedom adjustments commonly used with cluster-robust variance-covariance estimators.

For example,

output = Dict{Symbol,Float64}()
varlist = coefnames(obj)
β = StatsBase.coef(obj)
R = diagm(map(var -> var ∈ restrictions, varlist))
R = R[find(mapslices(col -> col[max(1, findfirst(col))] .== 1,
        RowEchelon.rref(round.(R, 12)), 1)),:]
k = size(R, 1)
Bread = R * β
Meat = inv(R * Vcov * R')
output[:Statistic] = (Bread' * Meat * Bread) / k
F_Dist = Distributions.FDist(k, rdf)
output[:p_value] = Distributions.ccdf(F_Dist, output[:Statistic])
output

nalimilan commented 6 years ago

model_matrix seems fine. coefnames makes sense, but is harder to implement right now as I don't think we have an standard API to pass these names when fitting a model from a formula: currently they are stored in DataFrameRegressionModel only. I think we need to change this so that the model knows about the coefficient names, even when fitting directly from a model matrix. One solution would be to pass names via an argument, another would be to use an AxisArray to represent the model matrix, with column names for coefficient names. Cc: @kleinschmidt.

Regarding wald_test, could you give examples of how the function would be used? Maybe we should give it a more general name so that tests other than Wald can be used to test restrictions (chosen via an argument)?

kleinschmidt commented 6 years ago

What would the desired behavior of coefnames be when there's no named data source backing the model? I think R does something like "X1, X2, ..." but I'm not sure.

Nosferican commented 6 years ago

Coefnames

I think each package can define the coefnames per their structs. For example,

StatsBase.coefnames(obj::DataFrames.DataFrameRegressionMode) =
    DataFrames.coefnames(getfield(obj, :mf))
abstract type PkgAbstractType <: StatsBase.RegressionModel end
StatsBase.coefnames(obj::PkgAbstractType) = getfield(obj, :varlist)

In case the model does not have the method I would prefer to have it not implemented. If a function wants to use a proxy such as "X1, X2, ..." then it can run it in the function rather than have it as an implemented method for objects without the information. Alternatively, it could be the default for StatsBase.RegressionModel and override that outcome when it is implemented.

Wald Test

I use it during coeftable as the F-test when using the conventional covariance matrix or Wald-test when using a robust variance-covariance estimator. See these example. It allows for testing joint-significance (e.g., commonly used with categorical variables with more than two levels).

nalimilan commented 6 years ago

Actually, I think it would be better to return an AxisArray from coef, with coefficient names as the dimension name (and a NamedTuple when fitting distributions with a fixed number of coefficients, which currently return a tuple). That would be much more natural and practical, e.g. if you want to extract one coefficient using its name. That's how it works in R. Currently coef is almost useless since you don't know which value corresponds to which coefficient.

A model type could still return a plain Array if it doesn't have names (which indicates the user doesn't care and handles names manually).

I use it during coeftable as the F-test when using the conventional covariance matrix or Wald-test when using a robust variance-covariance estimator. See these example. It allows for testing joint-significance (e.g., commonly used with categorical variables with more than two levels).

I meant, could you show some examples of what the Dict API you described would look like?

Nosferican commented 6 years ago

The simplest case could be something like:

Dict{Symbol,Float64} with 2 entries:
  :Statistic => 23.03

Given the rdf argument it can select the appropriate F-distribution and calculate the p-value,

Dict{Symbol,Float64} with 2 entries:
  :Statistic => 23.03
  :p_value   => 0.0008

More advanced iterations could have additional items such as the restrictions:

Dict{Symbol,Float64} with 2 entries:
  :Statistic => 23.03
  :p_value   => 0.0008
  :Restrictions => ["X1 = 0", "X2 = 1", "X3 = 2.5 * X4"]

It needs not to be a dictionary as the results could be a tuple for example.

The tricky part is developing a nice syntax to construct the restriction matrix based on some parsable syntax such as "X3 = 2.5 * X4 - X2" yielding (assuming X1 - X4 variables)

R = [ 0 1 1 -2.5]

and checking that it is a valid construct (no redundant restrictions, etc.). For now I think it would be best to just have the joint-significance specified as including the variable and alternatively passing a restriction matrix rather than a list of coefficients for more elaborated linear combinations.

nalimilan commented 6 years ago

Sorry, I had misread your description, I thought the Dict was supposed to be used to specify the restrictions. Rather than returning a Dict, I think it would make sense to return a special struct, just like we recently did for the F test (JuliaStats/GLM.jl#182).

I think it makes sense to start with a simple implementation, as long as it can be extended later without signature conflicts (e.g. formulas would be natural to use here). Can you have a look at what APIs other statistical software use for this test?

Regarding the question of how to store coef names, I've just filed https://github.com/JuliaStats/StatsModels.jl/issues/32, which is related in that it would allow accessing coefficient names from a Wald test function defined in StatsBase.

Nosferican commented 6 years ago

Sounds good. I believe a WaldTest struct could be a fine solution.

We could probably have an abstract type for tests (I am working on a submodule for Econometrics.jl with tests for heteroscedasticity, multicollinearity, consistency of estimator, etc.). A sub-abstract could manage the Wald test, the likelihood-ratio test and the Lagrange multiplier tests as these three are asymptotically equivalent and tell different aspects of the same question.

Stata, for example, implements these methods using the following syntax test: linear combinations,

test (tests all coefficients are equal to zero except the intercept)
test varlist (test all coefficients in varlist are equal to zero - joint significance)
test lincomb (test the specified linear combination(s))

Likelihood tests are done through lrtest.

Matlab computes the Wald test by passing the arguments as matrices (see documentation)

Not entirely sure how would the formulae work the most intuitively... I think the easiest would be to have (1) all, but intercept [default], (2) provide a Vector{Symbol}, or (3) pass a restriction matrix directly.

pdeffebach commented 5 years ago

Adding to this is two requests stemming from https://github.com/JuliaStats/GLM.jl/issues/259

We should separate the number of rows in the matrix used for estimation from the sum of the weights in the weight vector.

Current Julia behavior returns nobs(model) which gives the sum of the weights. Getting the number of observations in the regression requires inspecting the model frame.

In Stata:

reg y x [pw = w] displays the sum of weights, but does not store it in e(). e(N) gives the number of rows.
svy: reg y x where svyset [pw = w] does indeed store the weights in e(). It uses e(N) for the number of rows and e(N_pop) for the sum of weights.

In R the felm function in the package lfe returns a model where you can do

m$M: number of rows in the matrix
m$weights: the vector of weights used in the regression.

Nosferican commented 5 years ago

How about

nobs(obj::StatisticalModel, weights::Bool = true)

pdeffebach commented 5 years ago

Thats what milan suggested earlier, and I think it is the right solution. But would have to go into an agreed upon model api. I cross-posted here because this seems like the relevant StatsBase issue.

I feel like this needs buy-in from various people who create RegressionModel objects. Hope its helpful.

Thanks.

JuliaStats / StatsBase.jl

Additional Methods for RegressionModel #300

Coefnames

Wald Test