Open matthieugomez opened 7 years ago
I don't understand the use case for capturing variable names anywhere but in the actual construction of the model terms. Also, macros don't support keyword arguments.
I have three difference use cases in FixedEffectModels, you can simply read the Readme. Or look at all the R packages that rely on multipart formula
Macros can work with keyword arguments as they are just particular expressions. As in@test
syntax
using Base.test
@test 1 ≈ 0.99 atol = 1e-2
I always thought passing symbols for variable names would be enough. That doesn't work for where
, but I'm not sure I like the idea of supporting where
in model fitting functions, as it sounds simpler to perform this step separately.
Maybe I will be clearer if I delve into FixedEffectModels. To estimate these kinds of models, an estimation command requires the user to specify (i) a formula (ii) a set of high dimensional fixed effects (iii) a set of variables to cluster on. A syntax that use symbol for everything except the formula, as proposed by @nalimilan, would look like that:
fit(@formula(y~x1+x2), fe = (:x3, :x4), vcov = cluster(:x5, :x6))
The syntax looks clumsy. Moreover, this syntax does not allow to use expressions such as x3&x4
in fe
or vcov
. That is, the syntax allows to regress on x3&x4
, but does not allow to cluster on x3&x4
.
To avoid these problems, package developers in R have turned to the Extended Model Formula R package. For instance, to estimate models with high dimentional fixed effects, the R package lfe uses the syntax:
felm(y~x1+x2 | x3&x4 + x4 | x6 + x7)
Although this syntax improves on the first one, I don't like it either because it is not clear what each part of the formula refers to. This is why I am pushing for a keyword syntax, something similar to
@formula(y~x1+x2, fe = x3&x4 + x4, vcov = cluster(x6 + x7))
While these examples focus on models with high dimentional fixed effects — the ones I am familiar with — these syntax issues are more common. A lot of R packages rely on the Extended Model Formulas R Package
I guess what we could do is that a general macro like @model
or @fit
would transform every expression starting with ~
into a formula, so that you don't need to repeat @formula
. But automatically transforming fe = x3 + x4
into a formula sounds difficult/impossible to me: it doesn't generalize, and yet that macro would have to be defined only once in StatsModels.
So the solutions I can imagine are like:
@fit(y ~ x1+x2, fe = (:x3, :x4), vcov = cluster(:x5, :x6))
@fit(y ~ x1+x2, fe = ~ x3 + x4, vcov = cluster(~ x5 + x6))
We should just get better about allowing function calls in formulas. (Last time I checked, we either didn't support that at all or only kind of did.) Then you can do what R does for example with offset
in Poisson regression: @formula(y ~ a + b + offset(c))
In your case it'd be @formula(y ~ x1 + x2 + fe(x3, x4) + cluster(x6, x7))
.
@ararslan I don't think this overloading would work. This overloading would completely abuse mathematical notations : clustered variables refer to methods to compute standard errors, not variables to regress on. Moreover, what if there are some functions defined fe
or cluster
in the namespace?
Also, note that it is already possible to have function calls in formula in R, yet it did not prevent people to require multipart formulas.
So, would one of my examples suit your needs?
fit
wouldn't even need to be a macro for the former example.
fit(@formula(y ~ x1 + x2), fe=(:x3, :x4), vcov=cluster(:x5, :x6), ...)
You just need a Symbol
varargs method for cluster
.
Yes, that would work. But why would you need to only capture the arguments prefixed by ~
? The macro could also capture everything, i.e. something like
macro fit(x, args...)
Expr(call, :fit, :(@formula($(esc(x)))), (Base.Meta.quot(args[i]) for i in 1:length(args))...)
end
Each package would then define the fit
function.
But not all arguments must be formulas. And in some cases you explicitly need to refer to local objects rather than to data frame columns.
What about just using several @formula
arguments?
@nalimilan I agree, but this problem is common to every dplyr-like data manipulations (like DataFramesMeta).
@andreasnoack Something like this would work indeed.
fit(df, @formula(y ~ x1), fe = @formula(x3&x4 + x4), weight = @formula(x5), vcov = cluster(@formula(x3+x4))
It is quite verbose. I think I prefer Syntax 1
fit(df, @formula(y ~ x1), @fe(x3&x4 + x4)), @weight(x5), @vcov(cluster(x3+x4)))
but it requires to pollute the user space with macro definitions for each keyword argument.
Maybe we can come back to this issue once there is a commonly accepted way to handle expressions that may or may not use variable names (i.e. something akin to a dplyr-like package). I just wanted to have this issue somewhere — it would be great to nail the syntax of estimation commands.
I don't think the two issues are completely related (and that's good news since it makes it easier to fix the present one). Query macros need to deal with identifiers which most frequently refer to data frame columns, with a way to escape this occasionally (e.g. using $x
, which was evoked at some point). fit
/@fit
is quite different, since most arguments do not refer to columns: data frame object, distribution family, number of iterations, etc. Arguments which have to refer to columns are more the exception than the rule. What we need is a way to make them not too painful to use, but not the default either. I'd say my proposals are reasonable compromises, though there are other solutions.
Right, I agree with this.
I am working on a econometrics package and would be nice for implementing estimators for panel data (cross sectional unit, time unit, cluster, instruments, etc.).
@Nosferican Can you give some illustrations of your needs like @matthieugomez did above?
It would be quite similar.
fit(@formula(y ~ x1 + x2 + x3), df, @xt(x, t), family, link, estimator,
@instruments(x3 ~ z1 + z2), @cluster(x4), @weights(x5))
Another examples of packages using expanded syntax in the formulas other than FixedEffectModels.jl is the MixedModels.jl
m = fit!(lmm(Yield ~ 1 + (1 | Batch), ds))
m2 = fit!(lmm(y ~ 1 + dept*service + (1|s) + (1|d), inst))
fm3 = fit!(lmm(Reaction ~ 1 + Days + (1+Days|Subject), slp))
OK, looks similar indeed.
I think the design should be as generic as possible, i.e. it shouldn't require particular models to define macros like @cluster
or @weights
, since these are going to clash between packages. So arguments referring to variables inside the data frame should either be formulas, or (a tuple of) symbols when a formula wouldn't be natural (e.g. @xt(x, t)
can be xt=(:x, :t)
, or even x=:x, t=:t
).
To avoid the tedious requirement repeat @formula
, the simplest and most systematic solution would be to implement a @fit
macro which would replace all arguments starting with ~
with a formula. People would still be able to use @formula
explicitly with the fit
function if they prefer. Any objections to that approach?
(I would say the case of MixedModels is slightly different, since I don't think @dmbates has any complaints about the (1|x)
syntax: these terms are really part of the same equation, contrary to clusters or instruments.)
@nalimilan Your solution would work but I don't really like this syntax. Using ~
as a prefix to capture an expression is an R quirk and I would be glad to have it go away. That was the whole point of switching to @formula
.
What about Syntax 2? In this syntax, @formula
accepts multiple arguments, that are all captured. The captured arguments are then passed to a function _formula
, which can be extended by package developers.
# Code in StatsModels
# @formula(ex, kw1 = arg1, kw2 = arg2) is transformed into
# _formula(:(ex), kw1 = :(arg1), kw2 = :(arg2))
using StatsModels
macro formula(args...)
Expr(:call, :_formula, (transform_expr(a) for a in args)...)
end
function transform_expr(ex)
if isa(ex, Expr) && ex.head == :(=)
return Expr(:kw, ex.args[1], Base.Meta.quot(ex.args[2]))
else
return Base.Meta.quot(ex)
end
end
# StatsModel define _formula with 1 argument
function _formula(ex)
StatsModels.Formula(ex.args[1], ex.args[2])
end
# Code in say FixedEffecModels
# Each package can redefine _formula with specific keyword arguments
function _formula(ex::Union{Expr, Symbol}; fe::Union{Expr, Symbol} = :nothing, weight::Union{Expr, Symbol} = :nothing, vcov::Union{Expr, Symbol} = :nothing)
println("it works")
end
@formula(y ~ x1 + x2 + x3, fe = x1 + x2, weight = x3, vcov = cluster(x4 + x5))
The full syntax with fit
is then something like:
fit(@formula(y ~ x1 + x2 + x3, fe = x1 + x2, weight = x3, vcov = cluster(x4 + x5)), maxiter = 100)
The problem with that syntax is that it puts inside a common @formula
call arguments which are unrelated: weights
has nothing to do with vcov
(or at least not more than with maxiter
). I agree that ~
is a bit annoying when there is not LHS, but that's a small issue in comparison. Also, that symbol is not only used to escape an expression (which can be done using :()
in Julia), it also indicates that the argument uses the formula syntax, i.e. terms and interactions can be specified using +
, &
and *
. To me, that's a good reason to use the same symbol everywhere. If an argument doesn't need this syntax, it should just be a (list of) symbol(s).
I believe the difference would most likely be whether we want @formula
to be specific to a model specification with form response ~ variables
with variables allowed to perform contrasts specified through +
, *
, and &
or generalize it. Stata is a more pure formula system which operates with variables contrasts versus R which has enhanced formulas for on-the-fly feature generation and multipart formulas (i.e., it has a single lhs and the rhs is an array divided by the |
operator). Examples include Stata i.
, c.
, #
, ##
, L
, and D
operators and R's log
, poly
, I
, *
, and multipart features.
Stata: reg y X1 i.X2 c.X3 X4#X5 X6##X7 L.X8 D.X9, cl(X10)
R: log(y) ~ poly(X1, 2) + I(X2 >= 2) + X3 + X4 * X5 | X3 ~ Z1 + Z2
While I personally prefer to compute the features outside the formula in certain cases such as when computing the margins of a logistic regression (logit, multinomial, or ordinal) it is essential to keep track which variables are related and how which is done through passing the transformation in the formula. Easier cases such as with instrumental variables can be traced, but for harder cases it is almost impossible unless the feature engineering occurs in the formula. Other times it is just for convenience (e.g., weights as a function of the residuals).
While R uses the |
operator to create multi-part rhs, I am a fan of keywords (pray every night for Julia to get keyword argument mapping like in R). I believe a Stata approach of a core expanded formula should be followed by ,
keyword arguments.
Guess my vote is in favor of keeping the formula with the bare minimum (indispensable and related) and pass additional arguments as keywords. For example, StatsModels.ModelFrame
uses the formula, but passes the contrast dictionary as an additional argument.
@nalimilan Syntax 2 is just a better version of R multipart formula. In these multipart formulas, all arguments specifying variable names are put together and separated by |
. I just propose to add keyword arguments to identify each part more clearly. All arguments inside @formula
(i.e. all the arguments that refer to variable names in Syntax 2) correspond to the model specification (i.e. weight, how to compute standard errors, etc). In some sense the macro @formula
could be called @model
.
One issue with your proposal is that it defines two different syntaxes to refer to variable names: symbols for arguments that always refer to only one variable (like weight
) vs formula for arguments that may refer to multiple variable names (like fe
). I think the resulting syntax looks clumsy
@fit(y~x1, fe = ~x2, weight = :w)
Or it would require to impose people to use ~varname
rather than :varname
.
Now, if you really don't like the idea of splitting the arguments between @formula
and fit
(i.e. Syntax 2), we can still use something like Syntax 3, i.e. a macro @fit
that captures all arguments. Each developer can then deal with evaluating each argument separately depending on the keyword argument
Or it would require to impose people to use ~varname rather than :varname.
I guess we could also allow both if it turns out to be confusing.
Your arguments agains Syntax 2 are well taken, although I tend to disagree when you say that weights has nothing to do with vcov (compared to maxiter). All arguments inside @formula (i.e. all the arguments that refer to variable names in Syntax 2), tend to correspond to the model specification (i.e. weight, how to compute standard errors, etc). In some sense the macro @formula could be called @model.
Maybe they "tend to", but that's not always the case. For example, the model family for GLMs is clearly related to the model specification, but it's not a formula. And we haven't considered more exotic models yet.
Now, if you really don't like the idea of splitting the arguments between @formula and fit (i.e. Syntax 2), we can still use something like Syntax 3, i.e. a macro @fit that captures all arguments. Each developer can then deal with evaluating each argument separately depending on the keyword argument
I don't think that works either. The macro needs to decide whether to transform arguments before it knows the type of the model, so it cannot use dispatch to let each package choose a behavior.
The macro would just capture all the expressions and pass them to the function. The function would then specify which argument should be evaluated. Is there a problem with that?
Yes, the function would receive expressions which refer to a different scope. For example, maxiter=n
would be transformed into maxiter=:n
, but n
wouldn't exist in the function's scope (actually, the scope of the module), or it could be shadowed by a local variable, by another keyword argument... More generally, calling eval
inside functions isn't recommended in Julia.
Correct me If I'm wrong, but I think we can avoid this issue.
@fit(fe = z, maxiter = n)
would call _fit(fe = :z, maxiter = :n)
which would return the expression :(fit(fe = :z, maxiter = n))
. This expression would then be evaluated in the user environment.
Just to give a rough sketch:
# Defined in StatsModels
macro fit(args...)
_fit((transform_expr(a) for a in args)...)
end
function transform_expr(ex)
if isa(ex, Expr) && ex.head == :(=)
return Expr(:kw, ex.args[1], Base.Meta.quot(ex.args[2]))
else
return Base.Meta.quot(ex)
end
end
# Defined in my package
function _fit(fe = :nothing, maxiter = :nothing)
Expr(:call, :fit, Expr(:kw, :fe, Base.Meta.quot(fe)), Expr(:kw, :maxiter, maxiter))
end
# @fit(fe = x, maxiter = n) is then transformed into fit(fe = :x, maxiter = n)
Again, that's the same problem: your _fit
function will conflict with the one defined in another package since its signature does not have any specific to your package.
The keyword arguments are specific to my package, no? And how is this problem the same as the one regarding the function scope?
AFAIK Julia doesn't care about the names of the function arguments, just their types. So unless you have a function argument that accepts only a type defined in your package, it will conflict.
...and keyword arguments do not even participate to dispatch.
Thanks for the precision. I guess I have to think a little bit more about it.
One possibility would be that each package defines its particular macro — rather than relying on a unique @fit
defined in StatsModel
. So, for instance, in my package, I would define the macro@fefit
@fefit(y~x, fe = z, maxiter = 10)
Another possibility, which I guess it the most general, would be to denote with a particular syntax the arguments that should be captured:
# => instead of =
@fit formula => y~x fe => z maxiter = 10
# [=] instead of =
@fit [y~x] [fe = z] maxiter = 10
# All non keyword arguments are captured
@fit y~x fe(z) maxiter = 10
# Like Syntax 2
fit(@model(y~x, fe = z), maxiter = 10)
A last possibility is simply to use symbols everywhere to refer to variable names.
fit(:y ~ :x, fe = :z + :v, maxiter = 10)
There would be no need for macros anymore. On the minus side, it would require to define stuff like
+(a::Symbol, b::Symbol) = :($a + $b)
&(a::Symbol, b::Symbol) = :($a & $b)
*(a::Symbol, b::Symbol) = :($a * $b)
It would also preclude the use of functions in formula (i.e. :y ~ f(:x)
).
To avoid these two points, a related possibility would be to use symbols to refer to variable names but still use a macro,
@fit(:y~:x, fe = :z + :v, maxiter = 10)
A big advantage of the macro approach over the symbols only is that it allows for operations such as contrasts (&
, &&
) and eventually feature engineering directly on the formula.
I would be happy to have @model
which would parse a @formula
and manage keyword arguments with a =>
assign/capture operator. This approach might even help packages that rely heavily on generating a model which then is passed to a fit function.
Each package can of course provide its own macro, but it would be nice to be able to harmonise the model fitting interfaces. Regarding propositions 2 and 3, I really don't see the point of reinventing a syntax for formulas.
Perhaps @dmbates could help unify the tribes.
Gee, thanks, @ararslan
I was trying to keep my head down.
@nalimilan out of curiosity, what kind of problem would it create to capture every arguments (except for, say, the dataframe), and impose users to directly insert values in the AST, i.e.
n = 10
@fit(df, y~x, maxiter = $n)
The main problem would be that it's not standard so it would be surprising/confusing. Note that even df
would have to be written as $df
for that syntax to be fully consistent, and the same would apply e.g. to the model family ($(Binomial())
). So you're trading ~
in formulas with $
for other arguments, which is probably not a win.
Thanks. To avoid the consistency problem you mention, what about a fit(df, ModelType, @model(...))
syntax , where every argument in model
is captured?
fit(GeneralizedLinearModel, df, @model(y~x, Poisson(), maxiter = 10))
Note that $
is required only when using @model
within a function, i.e. during non interactive use (vs using ~
everytime):
function f(modelfamily, n)
fit(GeneralizedLinearModel, df, @model(y~x, $(modelfamily), $(n))
end
I don't like the idea of a syntax that only works in the global scope (and I suspect I'm not the only one).
The syntax does not only work in the global scope. Exactly like @formula
, @model
requires expression interpolation when modifying one of its arguments.
I have now defined the macro @model
in FixedEffectModels here. It accepts a formula and a set of keyword arguments, and simply returns a Model
which contains the captured expressions:
using FixedEffectModels
@model y ~ x fe = v1 + v2 weights = w
# Formula: y ~ x
# fe: v1 + v2
# weights: w
I wish StatsModels had a similar macro.
Latecomer to the party. I've been having the same issues, but I wasn't aware of this discussion. My solution consisted of separating data processing from estimation and using keywords. For example:
data = Microdata(DF, response = "y", control = "1 + x1 + x2", treatment = "t", instrument = "z", weight = "w")
results = fit(IV, data, method = "TSLS")
It seems similar to Syntax 2 above.
By the way, I don't like the y ~ x
structure. I understand that it's understand, but it's awkward. Consider linear IV and IV weighting, for instance. In the first case, y is regressed on x and t, so the formula makes sense. In the second case, z is regressed on x to construct weights for a regression of y on t, so the formula doesn't make as much sense. They shouldn't need different inputs though...
It seems similar to Syntax 2 above.
Well, not completely, since variables and formulas are passed as strings, so that Microdata
is a standard function, not a macro. Since quotes make it possible to distinguish between formulas and standard arguments (like DF), this solution is more similar to my proposal, which is to use a macro and indicate that an argument is a formula via ~
(or possibly :
for single variables).
By the way, I don't like the y ~ x structure. I understand that it's understand, but it's awkward. Consider linear IV and IV weighting, for instance. In the first case, y is regressed on x and t, so the formula makes sense. In the second case, z is regressed on x to construct weights for a regression of y on t, so the formula doesn't make as much sense. They shouldn't need different inputs though...
Sorry, I'm not familiar with the example you are using, so I don't understand the problem. Could you elaborate? Note that you don't need to use y ~ x
if that doesn't make sense for your use case: formulas are relatively flexible.
In my latest iteration I have decided to follow @matthieugomez scheme for the core formula:
response ~ exogenous + (endogenous ~ instruments)
for fixed effects a one-sided formula
absorb = fe1 + fe2 * fe3
which is similar to Stata's reghdfe
.
For the "model" I am using a @model
which also includes the data,
model = @model(data = df, formula = y ~ x1 + (z1 ~ Z1), absorb = x + t, weights = w)
@model
captures the values and puts them in a dictionary which is passed to my struct generator which parses it and populates the package struct <: StatsBase.RegressionModel
. StatsBase.fit!
is expanded to have a method for the package struct.
I'm still not convinced that a macro capturing all arguments is a good idea. That means @model
will only work for your package, and if any other package wants a similar mechanism it will have to use another name or experience conflicts. Also, as I noted, it doesn't allow passing references to local variables (except via ugly hacks for the global scope only). Really, requiring people to write ~ x + t
and ~w
or :w
doesn't sound like a big deal and it would make the macro much more standard.
A possible fix for not having to have a qualified @model
is to have it part of StatsModels
, or DataFrames
for the moment, as:
macro model(args...)
args = Dict(StatsModels.parsemodelargs.(args))
model = StatsModels.modelbuilder(args)
return model
end
function StatsModels.modelbuilder(args::Dict{Symbol,Any})
# Some code that dispatches the module's modelbuilder based on
# get(args, :module, :Default)
end
StatsBase.fit!(model::ModuleStruct)
and dispatch on the module the user wants to use. The only argument I can think of that could or should be local I think is the data. I use,
function parsemodelarg(obj::Expr)
args = getfield(obj, :args)
field = first(args)
value = last(args)
if field == :data
tmp = Expr(:call, :(x -> return x), :data)
tmp2 = getfield(tmp, :args)
tmp2[2] = value
setfield!(tmp, :args, tmp2)
value = eval(Main, tmp)
end
...
return (field, value)
end
which could potentially allow for accessing that local reference in other spaces besides the Main module space (global). Again, I that would only be used in very limited instances I would think, but unsure if that can be used in all applications that you can think of. Otherwise, one can keep data apart and do something like FixedEffectModels.@model
@reg(df, @model(...))
A limitation for using ~ x + t
is that @formula
and ModelFrame
expect a :lhs
that is not nothing
in order to work limiting its use.
Anything that depends on eval
is a no-go for any standard package like StatsModels (and it's not a great idea in general). And you can't know in advance which argument is going to be a formula, and which one should be a local variable -- except by hardcoding their list in StatsModels, but we don't want to restrict what packages can do.
A limitation for using ~ x + t is that @formula and ModelFrame expect a :lhs that is not nothing in order to work limiting its use.
Maybe we can work on finding a good solution with formulas? We could allow an empty LHS, or use :(...)
, or find another symbol...
How about a @model
macro that takes a data frame and a model constructor and replaces all the ~
-quoted arguments with model matrices constructed from the data frame?
something like @model df GLM(z ~ 1+x*y, family=:binomial)
Several arguments of a function
fit
typically refer to dataframe variables which are not regressors. Some examples: variables used to compute standard errors, weight variables, variables used to specify rows on which to estimate the model, high dimensional fixed effects, mixed models, etc.It would be nice to think about the best syntax to refer to these variable. I have thought about three potential syntaxes:
@model(expr, args...)
.@with
in DataFramesMeta.jl), i.e.@fit(expr1, expr2, args...)
An additional benefit is that agreeing on a syntax would help to standardize the names of commonly used arguments like "weights" "vcov" "where" across different packages that do statistical estimations. Enforcing these keyword arguments across different statistical estimations, like in Stata, could do a lot to improve the user experience.