Open matthieugomez opened 7 years ago
I believe @kleinschmidt has implemented one-sided formulas, e.g. @formula(~ x + t)
, in a branch somewhere.
I hesitate to get too far into macro purgatory where it becomes unclear what things are doing.
I think it's the only way to deal with this though, right?
Basically,
abstract type ModelBuilder end
macro model(data::DataFrames.AbstractDataFrame, mb::ModelBuilder)
# Packages will create a model builder to use with the @model in order to parse
# arguments and return their PkgStruct <: StatsBase.RegressionModel for instance with
# to be used in StatsBase.fit!(model::PkgStruct)
end
Formulas will allow for one sided for those that require it such as absorbing variables. Maybe expand ModelFrame, ModelMatrix to take multiple formulas? That way it could fix the issues of missing values and return a list of ModelFrames, ModelMatrix or something?
Brainstorm,
mypart = mymodelbuilder(formula = y ~ x1 + x2 & x3, #edited
iv = z1 ~ z2 + z3,
absorb = ~ fe1 + fe)
model = StatsModels.@model df mypart
StatsBase.fit!(model)
StatsBase.coeftable(model)
The point of using a macro like @model
is that it would transform all formula argument (identified by their ~
) into Formula
objects, which can then be treated like normal objects. So you wouldn't need to specify @formula
for each argument (you don't need a second macro like @model
if you do that).
For correctly parsing the Expr
with ex.head == :(~)
it would need to work with:
Arguments without ex.head == :(~)
be passed as Expr
?
Anything with ex.head == :(~)
would be treated as if it was inside a @formula
call. Then we can discuss what kind of formulas should be supported. Anything else would be treated as a standard argument to a standard function.
My half-baked proposal is that the @model
macro is called with a data frame (or other tabular source) and a model constructor expression, and lowers to a container type kind of like DataFrameStatisticalModel
, but which extracts all the arguments of the constructor that involve ~
, replacing them in the constructor call with model matrices (and storing the parsed formula data structures themselves). To be concrete, that would look like
@model df GLM(y~x, family=Binomial)
# becomes
let formulas=(Formula(:x, :y), )
TableModel{GLM}(df, formulas, GLM(modelmatrix(formulas[1], df), family=Binomial)
end
The @model
macro and the TableModel
type are defined in StatsModels, as are generic fit!
methods for TableModel{<:Any}
that just call fit!
on the model field. If you want to define special fit!
behavior that does something other than just use the model matrices generated by the formulas, then you can defined depend on StatsModels and define methods for fit!(::TableModel{MyModelType})
(and other methods from statsbase etc.).
It might be necessary to add an intermediate call to some kind of builder function to keep from eagerly evaluating the model matrices here in case packages want to do something weird with the formulas, again using a default fallback defined in StatsModels that does that step.
This would all require a bit of re-structuring of the whole API, but I think that's for the best. I think the most disruptive thing is making sure everyone follows a two-step model construction and fitting API.
This might also require storing kw arguments with formulas in a dict in the TableModel
for this to work for use cases where the kw arguments need special behavior.
Considering the current implementation I would actually favor establishing the different components of the model universally. For example:
abstract Formulae end
mutable struct EnhancedFormula <: Formulae
response
exogenous
endogenous
instruments
fixedeffects
weights
panelid
temporalid
EnhancedFormula() = new()
end
Then ModelFrame
/ ModelMatrix
for example can generate the model arrays (y
, X
, z
, Z
, D
, etc.). For example, coefnames
could be a Dict{Symbol,Vector{String}}
. The basic macro would parse an enhanced syntax formula into a EnhancedFormula
.
y ~ x1 * x2 + x3 + (z1 + z2 ~ Z1 + Z2) + absorb(fe1 + fe2) +
weights(weight) + cluster(PID + TID) + PID(County) + TID(Year)
One can overload the Base.show
to display it in a more conventional format for example.
If anyone wants to give it a try I could implement it.
I find that EnhancedFormula
is similar to Microdata
, but it is less flexible.
I'm really opposed to that, and I think it actually shows why the current API should be changed: it's too inflexible. I don't want to have to specify every single possible additional part of the formula in StatsModels. I'd much rather tear things down and rebuild them to be flexible enough that people can extend the as they like.
How is it restrictive? Right now I can accomplish the same and with virtually anything I want hacking away the current implementations... For example, for fixed effects I just deepcopy
and setfield!
the left hand side to the response in order to use them in ModelFrame
, ModelMatrix
... For iv I construct my formula and switch sides for endogenous and give it a symbol in the left for instruments. The idea with enhanced formula is to make it easier rather than each in their package to have to hack away to build what they want. The struct is modeled as incomplete so if one doesn't need a field one can leave it undefined.
I see the formula DSL as a way of describing how columns in a tabular data store should be converted to numeric columns for modeling. How those vectors/matrices are interpreted is up to the particular model and should (IMO) be left out of the specification of formulae themselves.
An important thing that is maybe missing from this discussion is that we're planning to change the formula DSL to allow for arbitrary functions to be passed through; this would allow for special functions to be defined in packages that need them as well, which would obviate the need for adding them to the formula DSL.
It's restrictive because you need to list in advance all possible fields a model might need. You included temporalid
there, but if you weren't in this discussion we would never have guessed somebody might need it one day. What will we do once somebody tries to implement a new kind of model? We don't want to make a new StatsModels release every time somebody needs an additional field.
An alternative is to have just the various archetypes with few assumptions. The current model assumes only one type of formula (lhs::Symbol, rhs::Union{Expr,Symbol,Integer}). Enhanced formula is an child of the abstract Formulae
and ModelFrame
/ ModelMatrix
would be defined for Formulae
so if a package has some needs outside that scope or is better served by another structure they just need to create a struct with their fields as a child of Formulae
. The two alternatives are to do minimum (current Formula
and allow to hack it to achieve what someone needs) or allow for more advanced use giving up some flexibility. For example, instead of a struct it could be a dictionary with Formula
(of different types: standard, one sided, multiple left) and just have ModelFrame
"correctly parse" them. Meaning the dataframe will keep all variables that appear in the union of the formulas, and create the appropriate Vector{ModelFrame}
.
@kleinschmidt, regarding your approach, what about
in StatsModels
abstract type ModelSpecs end
modelbuilder(data::DataFrames.AbstractDataFrame, modelspecs::ModelSpecs) =
error("modelbuilder is not defined for $(typeof(modelspecs)).")
In the different packages,
mutable struct MyPkgModel <: StatsBase.RegressionModel
# Some code
end
function StatsBase.fit!(obj::MyPkgModel)
# Some code that fits the model
end
function StatsBase.coeftable(obj::MyPkgModel)
# Some code that returns the summary of the model
end
mutable struct MyPkgModelSpecs <: StatsModels.ModelSpecs
# Some code
end
function modelbuilder(data::DataFrames.AbstractDataFrames, modelspecs::MyPkgModelSpecs)
# Some code that uses the data and MyPkgModelSpecs to return a MyPkgModel
end
macro @myparser(args...)
# Code to parse my model specifications
end
API (WLOG let @myparser
in this case be @econmodel
)
modelspecs = @econmodel(formula = y ~ x + (z1 ~ Z),
estimator = FD, vcov = cluster(PID + TID))
model = modelbuilder(df, modelspecs)
fit!(model)
coeftable(model)
Yeah, that's something like what I meant by an "intermediate model building stage". The @model
macro might lower to a constructor for ModelBuilder{MyModel}(...)
, which would have a default for ModelBuilder{<:Any}
in StatsModels but could be overloaded if package authors wanted to do something fancy with the formulas captured by the macro.
One improvement to the formula language is something I've started to play around with in StreamModels, using Term
types to represent each term in the formula. Then a formula itself could be a FormulaTerm
, meaning you could (in principle) nest "formulas" within each other. So something like y ~ x + (z1 ~ Z)
would lower to FormulaTerm(Placeholder(:y), MatrixTerm(Placeholder(:x), FormulaTerm(Placeholder(:z1), Placeholder(:Z))))
. Then you can do what you want to with this intermediate representation at the model builder stage. The default would be to plug the data frame columns into the placeholders and spit out a response and model matrix for something like a GLM (in which case there would be an error when you have an embedded formula).
Though it would be nice to have a user-friendly API which would directly take a dataset and fit the model one it, in addition to a lower-level API which would allow constructing a ModelSpecs
/ModelBuilder
first and then call fit
on it separately. Maybe the user-friendly version could be provided by each package, e.g. with @econmodel
. The lower-level version would only use generic functions like @model
, modelbuilder
and fit
, but passing a type inheriting from StatsBase.StatisticalModel
for dispatch.
My impressions after my initial play with DataFrames / StatsModels for Julia 0.7 are very positive. StatsModels.Formula
seems a lot more flexible. For example, now I can have
myfunc(@formula(y ~ x1 + x2 & x3 + (z1 + z2 ~ Z1 + Z2) ~ fe1 + fe2), df, ...)
No longer I have to provide the parser and I can easily get each element from the StatsModels.Formula
. Not sure if possible, but I would like to have |
work similarly to ~
à la R's multipart, but ~
works well for now. As for capturing model specifications, I would favor having a StatsModels.@modelspecs
or StatsModels.@modeloptions
. That way all packages can have their non-formulae options captured by a common API. StatsModels.@modelspecs
could be defined like
function parsearg(obj::Expr)
args = obj.args
@assert (args[1] == :(=)) & (length(args) == 3) "Model options should have form: Argument = Value"
(obj.args[1], obj.args[2])
end
macro @modelspecs(args...)
Dict(parsearg.(args))
end
This way each package can have their function collect their formula, data, and model options using a common API. For example,
reg(@formula(y ~ x1 + (z1 + Z1)), df, @modelspecs(estimator = FirstDifference, vcov = HC1))
This also encourages packages from exporting structs that might overlap.
@Nosferican I've started to work on a prototype that's even more flexible in StreamModels. The first major difference that is relevant to this discussion is that the internal representation of formulae is as a tree of Term
sub-types, with a FormulaTerm
as the root (thus allowing sub-formulae). I'm still working out the details but this should allow for greater extensibility of the syntax (e.g., a package could add special parsing rules for whatever calls they'd like).
The second major difference (which is even more preliminary) is that it introduces a @model
macro that captures all the arguments to a normal model constructor call, converts formulae to Formula objects, and stores all the positional and keyword arguments in a ModelBuilder
struct. The ModelBuilder
type is parametrized by the type of the model (or more generally the call), which allows for overloading default behavior by specializing on the model/call type. For instance, constructing an actual model happens by calling build(::ModelBuilder, datasource)
, but you can easily add a specialized method build(::ModelBuilder{MyModel}, datasource)
to implement whatever transformation fo the captured arguments (including formulae) that you want. The default (as I imagine it) would be to just evaluate all the formulae into numerical matrices/vectors (along the lines of what ModelMatrix(ModelFrame(f, df))
currently does), and plug them into the constructor in place of the formulae, along with the rest of the captured arguments. These "built" models could then be fit
as usual (possibly wrapped in a ModelFrame
-like type that holds onto the model builder, in order to do things like predicting from new data which require the formulae).
I'm really keen to know what you think of that architecture, and whether it would allow for the kind of customization you need!
The rootlike structure seems similar to the native Expr
which I think works pretty well. I think we all agree we need a better terms representation moving forward. The Julia 0.6 version of my package uses that approach of a buildmodel(df, @model(formula = y ~ x, estimator = OLS))
which uses an intermediate ModelBuilder
struct which I then use to build MyStruct
. I wasn't sure how did the ModelBuilder
adapts to each package (ie.e, type of the model by the call) and envisioned as basically just a Dict{Symbol,Symbol}
to pass to the function.
A few comments:
StatsBase.coeftable
. Thus, keeping a copy of the formula sometimes is beneficial instead of just replacing it.AbstractDataFrames
than with matrices.Vector{Vector{Int64}}
which are used for operations by clusters/groups rather than getting a Matrix
. When there are interactions I still need to make the model matrix and then get the indices for the levels... If there were an internal contrasts for that output, it would be appreciated.That said, with the new more flexible StatsModels.@formula
/ StatsModels.Formula
I use a single function to parse the components and another macro to capture keyword arguments (e.g., this is the approach in Optim.options()
.
Representation, Exogenous, Endogenous, Instruments, FixedEffects = formulaparser(formula)
function formulaparser(formula::StatsModels.Formula)
formula = copy(formula)
representation = string(formula)
response = formula.lhs
formula = formula.rhs
if formula.args[1] == :(~)
fixedeffects = StatsModels.Formula(response, formula.args[3])
formula = formula.args[2]
else
fixedeffects = StatsModels.Formula(response, zero(Int64))
end
instrumental = false
for idx ∈ eachindex(formula.args)
if isa(formula.args[idx], Expr)
if formula.args[idx].args[1] == :(~)
endogenous = StatsModels.Formula(response, formula.args[idx].args[2])
instruments = StatsModels.Formula(response, formula.args[idx].args[3])
instrumental = true
deleteat!(formula.args, idx)
break
end
end
end
exogenous = StatsModels.Formula(response, formula)
if !instrumental
endogenous = StatsModels.Formula(response, zero(Int64))
instruments = StatsModels.Formula(response, zero(Int64))
end
return (representation, exogenous, endogenous, instruments, fixedeffects)
end
A common issue with either approach is that it would need to call ModelFrame
on each of the formulas... A ModelFrame
that accepted multiple formulas and contrasts would be nice to have. For example, one that returns a Vector{ModelMatrix}
.
As for API, I agree with @matthieugomez that standardizing the way to use statistical models is the way to go. As I am unclear on how to dispatch the ModelBuilder
, I am using this format for now...
modelspecs = MyPkg.@model(kw = value, kw = value,...)
model = buildmodel(df::DataFrames.AbstractDataFrame,
modelspecs::StatsModels.AbstractModelSpecs)
coeftable(model::StatsBase.RegressionModel)
In StatsModels the only changes needed are:
abstract type AbstractModelSpecs end
modelbuilder(df::DataFrames.AbstractDataFrame,
modelspecs::AbstractModelSpecs) =
error("modelbuilder is not defined for $(typeof(obj)).")
with this syntax the MyPkg.@model
is acting as the main command in Stata. It specifies what the regression model is and allows for the arguments to match the command. For example, Hausman-Taylor would accept response + exogenous + (endogenous ~ )
vs 2SLS which would require response + exogenous + (endogenous ~ instruments)
. It is flexible as the actual modelbuilder
can be dispatched on a captured arguments (e.g., estimator = logit
).
After some talk a few days back on Slack, apparently there is an undesirable trend of using macros rather than direct builders. After that discussion I tried a different approach which seems to be an improvement.
model = MyPkgModel(formula::StatsModels.Formula, data::DataFrames.AbstractDataFrame;
options::Dict{Symbol,Any} = Dict{Symbol,Any}())
where I defined
function parsearg(obj::Expr)
args = obj.args
@assert (obj.head == :(=)) & (length(args) == 2) "Model options should be especified using the following syntax: Argument = Value"
return (args[1], args[2])
end
macro options(args...)
Dict{Symbol,Any}(parsearg.(args))
end
An example is,
model = EconometricsModel(@formula(y ~ x1 + x2 + (z1 ~ z2) + (absorb = fe1) + (cluster = fe1)),
data,
options = @options(estimator = FD, λ = 0.1)
coeftable(model)
The options macro isn't necessary here; you can just define a function that splats keyword arguments into a dict: options(kw...) = Dict(kw)
.
I'm afraid this discussion has gotten pretty far from where it started. Standardizing on a way of specifying (eagerly-evaluated) options for building a model might be useful, but that's a separate issue from how best to handle extracting elements from a tabular data source based on multiple arguments to a model constructor.
Also, what I'm imagining is using a macro as a convenience so you don't have to type @formula
for every single argument (which would get tedious if you want to specify a lot of tiny one-sided formulae, which is my preferred solution for what @matthieugomez requested for things like weights, standard errors, etc.). That would just lower to a constructor for a ModelBuilder{ModelType}
object which has default behavior that can be overloaded by packages that need to intercept the formulae before they're interpreted. There would be nothing stopping someone from writing out the more verbose literal form.
The idea of the macro is so people can use val
instead of :val
for symbol values.
For the many components formula I just use response ~ exogenous + (endogenous ~ instruments) + (keyword = onesidedformula)
where keyword can be absorb
, cluster
, weights
, etc. This wasn't possible before, but is feasible now. The default formula parser could just return a Dict{Symbol, StatsModels.Formula}
for the keyword = onesidedformula
components. The ModelBuilder
could return a Dict{Symbol,Matrix{Float64}}
as model matrixes for each of the components. An alternative procedure could be the default for specific keyword
for example, absorb
or cluster
could use #37.
julia> f(; kw...) = Dict(kw)
f (generic function with 1 method)
julia> f(x=1)
Dict{Symbol,Int64} with 1 entry:
:x => 1
f(estimator = FD, vce = OLS)
UndefVarError: FD not defined
f(estimator = :FD, vce = :OLS)
Dict{Symbol,Symbol} with 2 entries:
:estimator => :FD
:vce => :OLS
@options(estimator = FD, vce = OLS)
Dict{Symbol,Any} with 2 entries:
:estimator => :FD
:vce => :OLS
Ah I misunderstood what you meant by symbols. This actually strikes me as a better example of "macro-creep" than having a multi-formula @model
macro (it's pretty idiomatic to use symbols as options like this; or else to define an Enum type...)
Microdata
uses this keyword approach. The user inputs strings, which the builder internally combines into a formula. No new macros are necessary. The only difference with respect to response ~ exogenous + (endogenous ~ instruments) + (keyword = onesidedformula)
is that you specify every element as a keyword, including the response and co.
@lbittarello do you have an example? I'm having trouble picturing this. That sounds like a case where a macro might be appropriate
Suppose that you want to run a linear instrumental regression. @Nosferican would write:
ModelFrame(@formula(y ~ (t ~ z) + x), data)
I write:
Microdata(data, response = "y", treatment = "t", instrument = "z", control = "x")
Microdata
creates a model matrix and a dictionary. The dictionary maps each variable set into the columns of the model matrix. The regression goes:
Y = getvector(microdata, :response)
X = getmatrix(microdata, :treatment, :control)
Z = getmatrix(microdata, :instrument, :control)
β = (Z' * X) \ (Z' * Y)
I guess that Microdata
is more verbose. But it's flexible. Suppose that you're implementing a semiparametric single-index model. You would ask the user for a response
, discrete
controls and continuous
controls. No special syntax would be necessary.
It's useful too when models interpret components differently. For example, linear regression in Stata goes regress y t x
, while inverse probability weighting is teffects ipw (y) (t x)
. Microdata
would use the same syntax for both.
The would exogenous = "x1 * x2"
be transformed to an Expr
and then StatsModels.Formula
?
That can be managed as well à la Stata @formula(y ~ x1 + x2 + (z1 ~ 0))
would create the same Dict{Symbol,Matrix{Float64}}
, but if the estimator uses only endogenous variables and no additional instruments it can just use the model matrices differently to construct the modelmatrix and any transformations.
An additional thought is that sometimes those components are better served with a different representation than a Matrix{Float64}
. For example, clusters are better as Vector{Vector{Vector{Int64}}}
in the form of Clusters[Dimension[Group[Observations]]]
(see #37). Maybe two outputs Dict{Symbol,Matrix{Float64}}
and Dict{Symbol,Vector{Vector{Vector{Int64}}}}
?
I agree that clusters are better in a separate matrix. Covariance options currently receive a different treatment. It would be easy to designate other special keywords.
There was some discussion about this in a GLM issue here. Linking here because I think that @formula(y ~ x, cluster(z))
or something similar, and have a ModelFrame object record that, would be very helpfu.
Several arguments of a function
fit
typically refer to dataframe variables which are not regressors. Some examples: variables used to compute standard errors, weight variables, variables used to specify rows on which to estimate the model, high dimensional fixed effects, mixed models, etc.It would be nice to think about the best syntax to refer to these variable. I have thought about three potential syntaxes:
@model(expr, args...)
.@with
in DataFramesMeta.jl), i.e.@fit(expr1, expr2, args...)
An additional benefit is that agreeing on a syntax would help to standardize the names of commonly used arguments like "weights" "vcov" "where" across different packages that do statistical estimations. Enforcing these keyword arguments across different statistical estimations, like in Stata, could do a lot to improve the user experience.