JuliaStats / StatsModels.jl

Specifying, fitting, and evaluating statistical models in Julia
248 stars 29 forks source link

Multipart Formula #21

Open matthieugomez opened 7 years ago

matthieugomez commented 7 years ago

Several arguments of a function fit typically refer to dataframe variables which are not regressors. Some examples: variables used to compute standard errors, weight variables, variables used to specify rows on which to estimate the model, high dimensional fixed effects, mixed models, etc.

It would be nice to think about the best syntax to refer to these variable. I have thought about three potential syntaxes:

  1. Define a macro for each argument that needs to capture variable names
    fit(df, @formula(y ~ x1), @weight(x2), @vcov(cluster(x3+x4)), @where(x5 >= 0), maxiter = 100)
  2. Define a model macro that accepts multiple arguments, i.e. @model(expr, args...).
    fit(df, @model(y ~ x1, weight = x2, vcov = cluster(x3+x4), where = (x5 >= 0)), maxiter = 100)
  3. Define a fit macro that accepts multiple arguments (syntax closest to reg in Stata and to @with in DataFramesMeta.jl), i.e. @fit(expr1, expr2, args...)
    @fit df y ~ x1 weight = x2 vcov = cluster(x3+x4) where = (x5 >= 0) maxiter = 100
    # or (either would work)
    @fit(df, y ~ x1, weight = x2, vcov = cluster(x3+x4), where = (x5 >= 0), maxiter = 100)

An additional benefit is that agreeing on a syntax would help to standardize the names of commonly used arguments like "weights" "vcov" "where" across different packages that do statistical estimations. Enforcing these keyword arguments across different statistical estimations, like in Stata, could do a lot to improve the user experience.

ararslan commented 6 years ago

I believe @kleinschmidt has implemented one-sided formulas, e.g. @formula(~ x + t), in a branch somewhere.

ararslan commented 6 years ago

I hesitate to get too far into macro purgatory where it becomes unclear what things are doing.

kleinschmidt commented 6 years ago

I think it's the only way to deal with this though, right?

Nosferican commented 6 years ago

Basically,

abstract type ModelBuilder end
macro model(data::DataFrames.AbstractDataFrame, mb::ModelBuilder)
    # Packages will create a model builder to use with the @model in order to parse
    # arguments and return their PkgStruct <: StatsBase.RegressionModel for instance with
    # to be used in StatsBase.fit!(model::PkgStruct)
end

Formulas will allow for one sided for those that require it such as absorbing variables. Maybe expand ModelFrame, ModelMatrix to take multiple formulas? That way it could fix the issues of missing values and return a list of ModelFrames, ModelMatrix or something?

Brainstorm,

mypart = mymodelbuilder(formula = y ~ x1 + x2 & x3, #edited
                                            iv = z1 ~ z2 + z3,
                                            absorb = ~ fe1 + fe)
model = StatsModels.@model df mypart
StatsBase.fit!(model)
StatsBase.coeftable(model)
nalimilan commented 6 years ago

The point of using a macro like @model is that it would transform all formula argument (identified by their ~) into Formula objects, which can then be treated like normal objects. So you wouldn't need to specify @formula for each argument (you don't need a second macro like @model if you do that).

Nosferican commented 6 years ago

For correctly parsing the Expr with ex.head == :(~) it would need to work with:

  1. Standard (lhs::Symbol, rhs::Union{Expr,Symbol,Integer,Void}) response ~ exogenous
  2. Instrumental (lhs::Union{Symbol,Expr}, rhs::Union{Symbol,Expr}) endogenous ~ additional instruments ... maybe include here those à la MixedModels?
  3. One sided such as fixed effects (lhs::Void, rhs::Union{Symbol,Expr})

Arguments without ex.head == :(~) be passed as Expr?

nalimilan commented 6 years ago

Anything with ex.head == :(~) would be treated as if it was inside a @formula call. Then we can discuss what kind of formulas should be supported. Anything else would be treated as a standard argument to a standard function.

kleinschmidt commented 6 years ago

My half-baked proposal is that the @model macro is called with a data frame (or other tabular source) and a model constructor expression, and lowers to a container type kind of like DataFrameStatisticalModel, but which extracts all the arguments of the constructor that involve ~, replacing them in the constructor call with model matrices (and storing the parsed formula data structures themselves). To be concrete, that would look like

@model df GLM(y~x, family=Binomial)
# becomes
let formulas=(Formula(:x, :y), )
    TableModel{GLM}(df, formulas, GLM(modelmatrix(formulas[1], df), family=Binomial)
end

The @model macro and the TableModel type are defined in StatsModels, as are generic fit! methods for TableModel{<:Any} that just call fit! on the model field. If you want to define special fit! behavior that does something other than just use the model matrices generated by the formulas, then you can defined depend on StatsModels and define methods for fit!(::TableModel{MyModelType}) (and other methods from statsbase etc.).

It might be necessary to add an intermediate call to some kind of builder function to keep from eagerly evaluating the model matrices here in case packages want to do something weird with the formulas, again using a default fallback defined in StatsModels that does that step.

This would all require a bit of re-structuring of the whole API, but I think that's for the best. I think the most disruptive thing is making sure everyone follows a two-step model construction and fitting API.

kleinschmidt commented 6 years ago

This might also require storing kw arguments with formulas in a dict in the TableModel for this to work for use cases where the kw arguments need special behavior.

Nosferican commented 6 years ago

Considering the current implementation I would actually favor establishing the different components of the model universally. For example:

abstract Formulae end
mutable struct EnhancedFormula <: Formulae
    response
    exogenous
    endogenous
    instruments
    fixedeffects
    weights
    panelid
    temporalid
    EnhancedFormula() = new()
end

Then ModelFrame / ModelMatrix for example can generate the model arrays (y, X, z, Z, D, etc.). For example, coefnames could be a Dict{Symbol,Vector{String}}. The basic macro would parse an enhanced syntax formula into a EnhancedFormula.

y ~ x1 * x2 + x3 + (z1 + z2 ~ Z1 + Z2) + absorb(fe1 + fe2) +
    weights(weight) + cluster(PID + TID) + PID(County) + TID(Year)

One can overload the Base.show to display it in a more conventional format for example. If anyone wants to give it a try I could implement it.

lbittarello commented 6 years ago

I find that EnhancedFormula is similar to Microdata, but it is less flexible.

kleinschmidt commented 6 years ago

I'm really opposed to that, and I think it actually shows why the current API should be changed: it's too inflexible. I don't want to have to specify every single possible additional part of the formula in StatsModels. I'd much rather tear things down and rebuild them to be flexible enough that people can extend the as they like.

Nosferican commented 6 years ago

How is it restrictive? Right now I can accomplish the same and with virtually anything I want hacking away the current implementations... For example, for fixed effects I just deepcopy and setfield! the left hand side to the response in order to use them in ModelFrame, ModelMatrix... For iv I construct my formula and switch sides for endogenous and give it a symbol in the left for instruments. The idea with enhanced formula is to make it easier rather than each in their package to have to hack away to build what they want. The struct is modeled as incomplete so if one doesn't need a field one can leave it undefined.

kleinschmidt commented 6 years ago

I see the formula DSL as a way of describing how columns in a tabular data store should be converted to numeric columns for modeling. How those vectors/matrices are interpreted is up to the particular model and should (IMO) be left out of the specification of formulae themselves.

An important thing that is maybe missing from this discussion is that we're planning to change the formula DSL to allow for arbitrary functions to be passed through; this would allow for special functions to be defined in packages that need them as well, which would obviate the need for adding them to the formula DSL.

nalimilan commented 6 years ago

It's restrictive because you need to list in advance all possible fields a model might need. You included temporalid there, but if you weren't in this discussion we would never have guessed somebody might need it one day. What will we do once somebody tries to implement a new kind of model? We don't want to make a new StatsModels release every time somebody needs an additional field.

Nosferican commented 6 years ago

An alternative is to have just the various archetypes with few assumptions. The current model assumes only one type of formula (lhs::Symbol, rhs::Union{Expr,Symbol,Integer}). Enhanced formula is an child of the abstract Formulae and ModelFrame / ModelMatrix would be defined for Formulae so if a package has some needs outside that scope or is better served by another structure they just need to create a struct with their fields as a child of Formulae. The two alternatives are to do minimum (current Formula and allow to hack it to achieve what someone needs) or allow for more advanced use giving up some flexibility. For example, instead of a struct it could be a dictionary with Formula (of different types: standard, one sided, multiple left) and just have ModelFrame "correctly parse" them. Meaning the dataframe will keep all variables that appear in the union of the formulas, and create the appropriate Vector{ModelFrame}.

Nosferican commented 6 years ago

@kleinschmidt, regarding your approach, what about in StatsModels

abstract type ModelSpecs end
modelbuilder(data::DataFrames.AbstractDataFrame, modelspecs::ModelSpecs) = 
        error("modelbuilder is not defined for $(typeof(modelspecs)).")

In the different packages,

mutable struct MyPkgModel <: StatsBase.RegressionModel
    # Some code
end
function StatsBase.fit!(obj::MyPkgModel)
    # Some code that fits the model
end
function StatsBase.coeftable(obj::MyPkgModel)
    # Some code that returns the summary of the model
end
mutable struct MyPkgModelSpecs <: StatsModels.ModelSpecs
    # Some code
end
function modelbuilder(data::DataFrames.AbstractDataFrames, modelspecs::MyPkgModelSpecs)
    # Some code that uses the data and MyPkgModelSpecs to return a MyPkgModel
end
macro @myparser(args...)
    # Code to parse my model specifications
end

API (WLOG let @myparser in this case be @econmodel)

modelspecs = @econmodel(formula = y ~ x + (z1 ~ Z),
        estimator = FD, vcov = cluster(PID + TID))
model = modelbuilder(df, modelspecs)
fit!(model)
coeftable(model)
kleinschmidt commented 6 years ago

Yeah, that's something like what I meant by an "intermediate model building stage". The @model macro might lower to a constructor for ModelBuilder{MyModel}(...), which would have a default for ModelBuilder{<:Any} in StatsModels but could be overloaded if package authors wanted to do something fancy with the formulas captured by the macro.

One improvement to the formula language is something I've started to play around with in StreamModels, using Term types to represent each term in the formula. Then a formula itself could be a FormulaTerm, meaning you could (in principle) nest "formulas" within each other. So something like y ~ x + (z1 ~ Z) would lower to FormulaTerm(Placeholder(:y), MatrixTerm(Placeholder(:x), FormulaTerm(Placeholder(:z1), Placeholder(:Z)))). Then you can do what you want to with this intermediate representation at the model builder stage. The default would be to plug the data frame columns into the placeholders and spit out a response and model matrix for something like a GLM (in which case there would be an error when you have an embedded formula).

nalimilan commented 6 years ago

Though it would be nice to have a user-friendly API which would directly take a dataset and fit the model one it, in addition to a lower-level API which would allow constructing a ModelSpecs/ModelBuilder first and then call fit on it separately. Maybe the user-friendly version could be provided by each package, e.g. with @econmodel. The lower-level version would only use generic functions like @model, modelbuilder and fit, but passing a type inheriting from StatsBase.StatisticalModel for dispatch.

Nosferican commented 6 years ago

My impressions after my initial play with DataFrames / StatsModels for Julia 0.7 are very positive. StatsModels.Formula seems a lot more flexible. For example, now I can have

myfunc(@formula(y ~ x1 + x2 & x3 + (z1 + z2 ~ Z1 + Z2) ~ fe1 + fe2), df, ...)

No longer I have to provide the parser and I can easily get each element from the StatsModels.Formula. Not sure if possible, but I would like to have | work similarly to ~ à la R's multipart, but ~ works well for now. As for capturing model specifications, I would favor having a StatsModels.@modelspecs or StatsModels.@modeloptions. That way all packages can have their non-formulae options captured by a common API. StatsModels.@modelspecs could be defined like

function parsearg(obj::Expr)
    args = obj.args
    @assert (args[1] == :(=)) & (length(args) == 3) "Model options should have form: Argument = Value"
    (obj.args[1], obj.args[2])
end
macro @modelspecs(args...)
    Dict(parsearg.(args))
end

This way each package can have their function collect their formula, data, and model options using a common API. For example,

reg(@formula(y ~ x1 + (z1 + Z1)), df, @modelspecs(estimator = FirstDifference, vcov = HC1))

This also encourages packages from exporting structs that might overlap.

kleinschmidt commented 6 years ago

@Nosferican I've started to work on a prototype that's even more flexible in StreamModels. The first major difference that is relevant to this discussion is that the internal representation of formulae is as a tree of Term sub-types, with a FormulaTerm as the root (thus allowing sub-formulae). I'm still working out the details but this should allow for greater extensibility of the syntax (e.g., a package could add special parsing rules for whatever calls they'd like).

The second major difference (which is even more preliminary) is that it introduces a @model macro that captures all the arguments to a normal model constructor call, converts formulae to Formula objects, and stores all the positional and keyword arguments in a ModelBuilder struct. The ModelBuilder type is parametrized by the type of the model (or more generally the call), which allows for overloading default behavior by specializing on the model/call type. For instance, constructing an actual model happens by calling build(::ModelBuilder, datasource), but you can easily add a specialized method build(::ModelBuilder{MyModel}, datasource) to implement whatever transformation fo the captured arguments (including formulae) that you want. The default (as I imagine it) would be to just evaluate all the formulae into numerical matrices/vectors (along the lines of what ModelMatrix(ModelFrame(f, df)) currently does), and plug them into the constructor in place of the formulae, along with the rest of the captured arguments. These "built" models could then be fit as usual (possibly wrapped in a ModelFrame-like type that holds onto the model builder, in order to do things like predicting from new data which require the formulae).

I'm really keen to know what you think of that architecture, and whether it would allow for the kind of customization you need!

Nosferican commented 6 years ago

The rootlike structure seems similar to the native Expr which I think works pretty well. I think we all agree we need a better terms representation moving forward. The Julia 0.6 version of my package uses that approach of a buildmodel(df, @model(formula = y ~ x, estimator = OLS)) which uses an intermediate ModelBuilder struct which I then use to build MyStruct. I wasn't sure how did the ModelBuilder adapts to each package (ie.e, type of the model by the call) and envisioned as basically just a Dict{Symbol,Symbol} to pass to the function.

A few comments:

That said, with the new more flexible StatsModels.@formula / StatsModels.Formula I use a single function to parse the components and another macro to capture keyword arguments (e.g., this is the approach in Optim.options().

Representation, Exogenous, Endogenous, Instruments, FixedEffects = formulaparser(formula)
function formulaparser(formula::StatsModels.Formula)
    formula = copy(formula)
    representation = string(formula)
    response = formula.lhs
    formula = formula.rhs
    if formula.args[1] == :(~)
        fixedeffects = StatsModels.Formula(response, formula.args[3])
        formula = formula.args[2]
    else
        fixedeffects = StatsModels.Formula(response, zero(Int64))
    end
    instrumental = false
    for idx ∈ eachindex(formula.args)
        if isa(formula.args[idx], Expr)
            if formula.args[idx].args[1] == :(~)
                endogenous = StatsModels.Formula(response, formula.args[idx].args[2])
                instruments = StatsModels.Formula(response, formula.args[idx].args[3])
                instrumental = true
                deleteat!(formula.args, idx)
                break
            end
        end
    end
    exogenous = StatsModels.Formula(response, formula)
    if !instrumental
        endogenous = StatsModels.Formula(response, zero(Int64))
        instruments = StatsModels.Formula(response, zero(Int64))
    end
    return (representation, exogenous, endogenous, instruments, fixedeffects)
end

A common issue with either approach is that it would need to call ModelFrame on each of the formulas... A ModelFrame that accepted multiple formulas and contrasts would be nice to have. For example, one that returns a Vector{ModelMatrix}.

Nosferican commented 6 years ago

As for API, I agree with @matthieugomez that standardizing the way to use statistical models is the way to go. As I am unclear on how to dispatch the ModelBuilder, I am using this format for now...

modelspecs = MyPkg.@model(kw = value, kw = value,...)
model = buildmodel(df::DataFrames.AbstractDataFrame,
    modelspecs::StatsModels.AbstractModelSpecs)
coeftable(model::StatsBase.RegressionModel)

In StatsModels the only changes needed are:

abstract type AbstractModelSpecs end
modelbuilder(df::DataFrames.AbstractDataFrame,
    modelspecs::AbstractModelSpecs) =
    error("modelbuilder is not defined for $(typeof(obj)).")

with this syntax the MyPkg.@model is acting as the main command in Stata. It specifies what the regression model is and allows for the arguments to match the command. For example, Hausman-Taylor would accept response + exogenous + (endogenous ~ ) vs 2SLS which would require response + exogenous + (endogenous ~ instruments). It is flexible as the actual modelbuilder can be dispatched on a captured arguments (e.g., estimator = logit).

Nosferican commented 6 years ago

After some talk a few days back on Slack, apparently there is an undesirable trend of using macros rather than direct builders. After that discussion I tried a different approach which seems to be an improvement.

model = MyPkgModel(formula::StatsModels.Formula, data::DataFrames.AbstractDataFrame;
                                      options::Dict{Symbol,Any} = Dict{Symbol,Any}())

where I defined

function parsearg(obj::Expr)
    args = obj.args
    @assert (obj.head == :(=)) & (length(args) == 2) "Model options should be especified using the following syntax: Argument = Value"
    return (args[1], args[2])
end
macro options(args...)
    Dict{Symbol,Any}(parsearg.(args))
end

An example is,

model = EconometricsModel(@formula(y ~ x1 + x2 + (z1 ~ z2) + (absorb = fe1) + (cluster = fe1)),
                                                 data,
                                                 options = @options(estimator = FD, λ = 0.1)
coeftable(model)
kleinschmidt commented 6 years ago

The options macro isn't necessary here; you can just define a function that splats keyword arguments into a dict: options(kw...) = Dict(kw).

I'm afraid this discussion has gotten pretty far from where it started. Standardizing on a way of specifying (eagerly-evaluated) options for building a model might be useful, but that's a separate issue from how best to handle extracting elements from a tabular data source based on multiple arguments to a model constructor.

Also, what I'm imagining is using a macro as a convenience so you don't have to type @formula for every single argument (which would get tedious if you want to specify a lot of tiny one-sided formulae, which is my preferred solution for what @matthieugomez requested for things like weights, standard errors, etc.). That would just lower to a constructor for a ModelBuilder{ModelType} object which has default behavior that can be overloaded by packages that need to intercept the formulae before they're interpreted. There would be nothing stopping someone from writing out the more verbose literal form.

Nosferican commented 6 years ago

The idea of the macro is so people can use val instead of :val for symbol values.

For the many components formula I just use response ~ exogenous + (endogenous ~ instruments) + (keyword = onesidedformula) where keyword can be absorb, cluster, weights, etc. This wasn't possible before, but is feasible now. The default formula parser could just return a Dict{Symbol, StatsModels.Formula} for the keyword = onesidedformula components. The ModelBuilder could return a Dict{Symbol,Matrix{Float64}} as model matrixes for each of the components. An alternative procedure could be the default for specific keyword for example, absorb or cluster could use #37.

kleinschmidt commented 6 years ago
julia> f(; kw...) = Dict(kw)
f (generic function with 1 method)

julia> f(x=1)
Dict{Symbol,Int64} with 1 entry:
  :x => 1
Nosferican commented 6 years ago
f(estimator = FD, vce = OLS)
UndefVarError: FD not defined
f(estimator = :FD, vce = :OLS)
Dict{Symbol,Symbol} with 2 entries:
  :estimator => :FD
  :vce       => :OLS
@options(estimator = FD, vce = OLS)
Dict{Symbol,Any} with 2 entries:
  :estimator => :FD
  :vce       => :OLS
kleinschmidt commented 6 years ago

Ah I misunderstood what you meant by symbols. This actually strikes me as a better example of "macro-creep" than having a multi-formula @model macro (it's pretty idiomatic to use symbols as options like this; or else to define an Enum type...)

lbittarello commented 6 years ago

Microdata uses this keyword approach. The user inputs strings, which the builder internally combines into a formula. No new macros are necessary. The only difference with respect to response ~ exogenous + (endogenous ~ instruments) + (keyword = onesidedformula) is that you specify every element as a keyword, including the response and co.

kleinschmidt commented 6 years ago

@lbittarello do you have an example? I'm having trouble picturing this. That sounds like a case where a macro might be appropriate

lbittarello commented 6 years ago

Suppose that you want to run a linear instrumental regression. @Nosferican would write:

ModelFrame(@formula(y ~ (t ~ z) + x), data)

I write:

Microdata(data, response = "y", treatment = "t", instrument = "z", control = "x")

Microdata creates a model matrix and a dictionary. The dictionary maps each variable set into the columns of the model matrix. The regression goes:

Y = getvector(microdata, :response)
X = getmatrix(microdata, :treatment, :control)
Z = getmatrix(microdata, :instrument, :control)
β = (Z' * X) \ (Z' * Y)

I guess that Microdata is more verbose. But it's flexible. Suppose that you're implementing a semiparametric single-index model. You would ask the user for a response, discrete controls and continuous controls. No special syntax would be necessary.

lbittarello commented 6 years ago

It's useful too when models interpret components differently. For example, linear regression in Stata goes regress y t x, while inverse probability weighting is teffects ipw (y) (t x). Microdata would use the same syntax for both.

Nosferican commented 6 years ago

The would exogenous = "x1 * x2" be transformed to an Expr and then StatsModels.Formula?

That can be managed as well à la Stata @formula(y ~ x1 + x2 + (z1 ~ 0)) would create the same Dict{Symbol,Matrix{Float64}}, but if the estimator uses only endogenous variables and no additional instruments it can just use the model matrices differently to construct the modelmatrix and any transformations.

An additional thought is that sometimes those components are better served with a different representation than a Matrix{Float64}. For example, clusters are better as Vector{Vector{Vector{Int64}}} in the form of Clusters[Dimension[Group[Observations]]] (see #37). Maybe two outputs Dict{Symbol,Matrix{Float64}} and Dict{Symbol,Vector{Vector{Vector{Int64}}}}?

lbittarello commented 6 years ago

I agree that clusters are better in a separate matrix. Covariance options currently receive a different treatment. It would be easy to designate other special keywords.

Nosferican commented 6 years ago

A few tools I have been developing such as VCE and various routines in EconUtils use a vector for the response, matrices for all others, but fixed effects to be absorbed and clusters for variance covariance estimation. The last two use the Vector{Vector{Vector{Int64}}}} form.

pdeffebach commented 5 years ago

There was some discussion about this in a GLM issue here. Linking here because I think that @formula(y ~ x, cluster(z)) or something similar, and have a ModelFrame object record that, would be very helpfu.