JuliaStats / StatsModels.jl

Specifying, fitting, and evaluating statistical models in Julia
247 stars 29 forks source link

Multipart Formula #21

Open matthieugomez opened 7 years ago

matthieugomez commented 7 years ago

Several arguments of a function fit typically refer to dataframe variables which are not regressors. Some examples: variables used to compute standard errors, weight variables, variables used to specify rows on which to estimate the model, high dimensional fixed effects, mixed models, etc.

It would be nice to think about the best syntax to refer to these variable. I have thought about three potential syntaxes:

  1. Define a macro for each argument that needs to capture variable names
    fit(df, @formula(y ~ x1), @weight(x2), @vcov(cluster(x3+x4)), @where(x5 >= 0), maxiter = 100)
  2. Define a model macro that accepts multiple arguments, i.e. @model(expr, args...).
    fit(df, @model(y ~ x1, weight = x2, vcov = cluster(x3+x4), where = (x5 >= 0)), maxiter = 100)
  3. Define a fit macro that accepts multiple arguments (syntax closest to reg in Stata and to @with in DataFramesMeta.jl), i.e. @fit(expr1, expr2, args...)
    @fit df y ~ x1 weight = x2 vcov = cluster(x3+x4) where = (x5 >= 0) maxiter = 100
    # or (either would work)
    @fit(df, y ~ x1, weight = x2, vcov = cluster(x3+x4), where = (x5 >= 0), maxiter = 100)

An additional benefit is that agreeing on a syntax would help to standardize the names of commonly used arguments like "weights" "vcov" "where" across different packages that do statistical estimations. Enforcing these keyword arguments across different statistical estimations, like in Stata, could do a lot to improve the user experience.

ararslan commented 7 years ago

I don't understand the use case for capturing variable names anywhere but in the actual construction of the model terms. Also, macros don't support keyword arguments.

matthieugomez commented 7 years ago

I have three difference use cases in FixedEffectModels, you can simply read the Readme. Or look at all the R packages that rely on multipart formula Macros can work with keyword arguments as they are just particular expressions. As in@test syntax

using Base.test
@test 1 ≈ 0.99 atol = 1e-2
nalimilan commented 7 years ago

I always thought passing symbols for variable names would be enough. That doesn't work for where, but I'm not sure I like the idea of supporting where in model fitting functions, as it sounds simpler to perform this step separately.

matthieugomez commented 7 years ago

Maybe I will be clearer if I delve into FixedEffectModels. To estimate these kinds of models, an estimation command requires the user to specify (i) a formula (ii) a set of high dimensional fixed effects (iii) a set of variables to cluster on. A syntax that use symbol for everything except the formula, as proposed by @nalimilan, would look like that:

fit(@formula(y~x1+x2), fe = (:x3, :x4), vcov = cluster(:x5, :x6))

The syntax looks clumsy. Moreover, this syntax does not allow to use expressions such as x3&x4 in fe or vcov. That is, the syntax allows to regress on x3&x4, but does not allow to cluster on x3&x4.

To avoid these problems, package developers in R have turned to the Extended Model Formula R package. For instance, to estimate models with high dimentional fixed effects, the R package lfe uses the syntax:

felm(y~x1+x2 | x3&x4 + x4 | x6 + x7)

Although this syntax improves on the first one, I don't like it either because it is not clear what each part of the formula refers to. This is why I am pushing for a keyword syntax, something similar to

@formula(y~x1+x2, fe = x3&x4 + x4, vcov = cluster(x6 + x7))

While these examples focus on models with high dimentional fixed effects — the ones I am familiar with — these syntax issues are more common. A lot of R packages rely on the Extended Model Formulas R Package

nalimilan commented 7 years ago

I guess what we could do is that a general macro like @model or @fit would transform every expression starting with ~ into a formula, so that you don't need to repeat @formula. But automatically transforming fe = x3 + x4 into a formula sounds difficult/impossible to me: it doesn't generalize, and yet that macro would have to be defined only once in StatsModels.

So the solutions I can imagine are like:

@fit(y ~ x1+x2, fe = (:x3, :x4), vcov = cluster(:x5, :x6))
@fit(y ~ x1+x2, fe = ~ x3 + x4, vcov = cluster(~ x5 + x6))
ararslan commented 7 years ago

We should just get better about allowing function calls in formulas. (Last time I checked, we either didn't support that at all or only kind of did.) Then you can do what R does for example with offset in Poisson regression: @formula(y ~ a + b + offset(c)) In your case it'd be @formula(y ~ x1 + x2 + fe(x3, x4) + cluster(x6, x7)).

matthieugomez commented 7 years ago

@ararslan I don't think this overloading would work. This overloading would completely abuse mathematical notations : clustered variables refer to methods to compute standard errors, not variables to regress on. Moreover, what if there are some functions defined fe or cluster in the namespace? Also, note that it is already possible to have function calls in formula in R, yet it did not prevent people to require multipart formulas.

nalimilan commented 7 years ago

So, would one of my examples suit your needs?

ararslan commented 7 years ago

fit wouldn't even need to be a macro for the former example.

fit(@formula(y ~ x1 + x2), fe=(:x3, :x4), vcov=cluster(:x5, :x6), ...)

You just need a Symbol varargs method for cluster.

matthieugomez commented 7 years ago

Yes, that would work. But why would you need to only capture the arguments prefixed by ~? The macro could also capture everything, i.e. something like

macro fit(x, args...)
    Expr(call, :fit, :(@formula($(esc(x)))), (Base.Meta.quot(args[i]) for i in 1:length(args))...)
end 

Each package would then define the fit function.

nalimilan commented 7 years ago

But not all arguments must be formulas. And in some cases you explicitly need to refer to local objects rather than to data frame columns.

andreasnoack commented 7 years ago

What about just using several @formula arguments?

matthieugomez commented 7 years ago

@nalimilan I agree, but this problem is common to every dplyr-like data manipulations (like DataFramesMeta).

@andreasnoack Something like this would work indeed.

fit(df, @formula(y ~ x1), fe = @formula(x3&x4 + x4), weight = @formula(x5), vcov = cluster(@formula(x3+x4))

It is quite verbose. I think I prefer Syntax 1

fit(df, @formula(y ~ x1), @fe(x3&x4 + x4)), @weight(x5), @vcov(cluster(x3+x4)))

but it requires to pollute the user space with macro definitions for each keyword argument.

matthieugomez commented 7 years ago

Maybe we can come back to this issue once there is a commonly accepted way to handle expressions that may or may not use variable names (i.e. something akin to a dplyr-like package). I just wanted to have this issue somewhere — it would be great to nail the syntax of estimation commands.

nalimilan commented 7 years ago

I don't think the two issues are completely related (and that's good news since it makes it easier to fix the present one). Query macros need to deal with identifiers which most frequently refer to data frame columns, with a way to escape this occasionally (e.g. using $x, which was evoked at some point). fit/@fit is quite different, since most arguments do not refer to columns: data frame object, distribution family, number of iterations, etc. Arguments which have to refer to columns are more the exception than the rule. What we need is a way to make them not too painful to use, but not the default either. I'd say my proposals are reasonable compromises, though there are other solutions.

matthieugomez commented 7 years ago

Right, I agree with this.

Nosferican commented 7 years ago

I am working on a econometrics package and would be nice for implementing estimators for panel data (cross sectional unit, time unit, cluster, instruments, etc.).

nalimilan commented 7 years ago

@Nosferican Can you give some illustrations of your needs like @matthieugomez did above?

Nosferican commented 7 years ago

It would be quite similar.

fit(@formula(y ~ x1 + x2 + x3), df, @xt(x, t), family, link, estimator,
    @instruments(x3 ~ z1 + z2), @cluster(x4), @weights(x5))

Another examples of packages using expanded syntax in the formulas other than FixedEffectModels.jl is the MixedModels.jl

m = fit!(lmm(Yield ~ 1 + (1 | Batch), ds))
m2 = fit!(lmm(y ~ 1 + dept*service + (1|s) + (1|d), inst))
fm3 = fit!(lmm(Reaction ~ 1 + Days + (1+Days|Subject), slp))
nalimilan commented 7 years ago

OK, looks similar indeed.

I think the design should be as generic as possible, i.e. it shouldn't require particular models to define macros like @cluster or @weights, since these are going to clash between packages. So arguments referring to variables inside the data frame should either be formulas, or (a tuple of) symbols when a formula wouldn't be natural (e.g. @xt(x, t) can be xt=(:x, :t), or even x=:x, t=:t).

To avoid the tedious requirement repeat @formula, the simplest and most systematic solution would be to implement a @fit macro which would replace all arguments starting with ~ with a formula. People would still be able to use @formula explicitly with the fit function if they prefer. Any objections to that approach?

(I would say the case of MixedModels is slightly different, since I don't think @dmbates has any complaints about the (1|x) syntax: these terms are really part of the same equation, contrary to clusters or instruments.)

matthieugomez commented 7 years ago

@nalimilan Your solution would work but I don't really like this syntax. Using ~ as a prefix to capture an expression is an R quirk and I would be glad to have it go away. That was the whole point of switching to @formula.

What about Syntax 2? In this syntax, @formula accepts multiple arguments, that are all captured. The captured arguments are then passed to a function _formula, which can be extended by package developers.

# Code in StatsModels
# @formula(ex, kw1 = arg1, kw2 = arg2) is transformed into  
# _formula(:(ex), kw1 = :(arg1), kw2 = :(arg2))
using StatsModels
macro formula(args...)
    Expr(:call, :_formula, (transform_expr(a) for a in args)...)
end
function transform_expr(ex)
   if isa(ex, Expr) && ex.head == :(=)
      return Expr(:kw, ex.args[1], Base.Meta.quot(ex.args[2]))
   else
      return Base.Meta.quot(ex)
  end
end
# StatsModel define _formula with 1 argument
function _formula(ex)
   StatsModels.Formula(ex.args[1], ex.args[2])
end

# Code in say FixedEffecModels
# Each package can redefine _formula with specific keyword arguments
function _formula(ex::Union{Expr, Symbol}; fe::Union{Expr, Symbol} = :nothing, weight::Union{Expr, Symbol} = :nothing, vcov::Union{Expr, Symbol} = :nothing)
  println("it works")
end
@formula(y ~ x1 + x2 + x3, fe = x1 + x2, weight = x3, vcov = cluster(x4 + x5))

The full syntax with fit is then something like:

fit(@formula(y ~ x1 + x2 + x3, fe = x1 + x2, weight = x3, vcov = cluster(x4 + x5)), maxiter = 100)
nalimilan commented 7 years ago

The problem with that syntax is that it puts inside a common @formula call arguments which are unrelated: weights has nothing to do with vcov (or at least not more than with maxiter). I agree that ~ is a bit annoying when there is not LHS, but that's a small issue in comparison. Also, that symbol is not only used to escape an expression (which can be done using :() in Julia), it also indicates that the argument uses the formula syntax, i.e. terms and interactions can be specified using +, & and *. To me, that's a good reason to use the same symbol everywhere. If an argument doesn't need this syntax, it should just be a (list of) symbol(s).

Nosferican commented 7 years ago

I believe the difference would most likely be whether we want @formula to be specific to a model specification with form response ~ variables with variables allowed to perform contrasts specified through +, *, and & or generalize it. Stata is a more pure formula system which operates with variables contrasts versus R which has enhanced formulas for on-the-fly feature generation and multipart formulas (i.e., it has a single lhs and the rhs is an array divided by the | operator). Examples include Stata i., c., #, ##, L, and D operators and R's log, poly, I, *, and multipart features.

Stata: reg y X1 i.X2 c.X3 X4#X5 X6##X7 L.X8 D.X9, cl(X10) R: log(y) ~ poly(X1, 2) + I(X2 >= 2) + X3 + X4 * X5 | X3 ~ Z1 + Z2

While I personally prefer to compute the features outside the formula in certain cases such as when computing the margins of a logistic regression (logit, multinomial, or ordinal) it is essential to keep track which variables are related and how which is done through passing the transformation in the formula. Easier cases such as with instrumental variables can be traced, but for harder cases it is almost impossible unless the feature engineering occurs in the formula. Other times it is just for convenience (e.g., weights as a function of the residuals).

While R uses the | operator to create multi-part rhs, I am a fan of keywords (pray every night for Julia to get keyword argument mapping like in R). I believe a Stata approach of a core expanded formula should be followed by , keyword arguments.

Guess my vote is in favor of keeping the formula with the bare minimum (indispensable and related) and pass additional arguments as keywords. For example, StatsModels.ModelFrame uses the formula, but passes the contrast dictionary as an additional argument.

matthieugomez commented 7 years ago

@nalimilan Syntax 2 is just a better version of R multipart formula. In these multipart formulas, all arguments specifying variable names are put together and separated by |. I just propose to add keyword arguments to identify each part more clearly. All arguments inside @formula (i.e. all the arguments that refer to variable names in Syntax 2) correspond to the model specification (i.e. weight, how to compute standard errors, etc). In some sense the macro @formula could be called @model.

One issue with your proposal is that it defines two different syntaxes to refer to variable names: symbols for arguments that always refer to only one variable (like weight) vs formula for arguments that may refer to multiple variable names (like fe). I think the resulting syntax looks clumsy

@fit(y~x1, fe = ~x2, weight = :w)

Or it would require to impose people to use ~varname rather than :varname.

Now, if you really don't like the idea of splitting the arguments between @formula and fit (i.e. Syntax 2), we can still use something like Syntax 3, i.e. a macro @fit that captures all arguments. Each developer can then deal with evaluating each argument separately depending on the keyword argument

nalimilan commented 7 years ago

Or it would require to impose people to use ~varname rather than :varname.

I guess we could also allow both if it turns out to be confusing.

Your arguments agains Syntax 2 are well taken, although I tend to disagree when you say that weights has nothing to do with vcov (compared to maxiter). All arguments inside @formula (i.e. all the arguments that refer to variable names in Syntax 2), tend to correspond to the model specification (i.e. weight, how to compute standard errors, etc). In some sense the macro @formula could be called @model.

Maybe they "tend to", but that's not always the case. For example, the model family for GLMs is clearly related to the model specification, but it's not a formula. And we haven't considered more exotic models yet.

Now, if you really don't like the idea of splitting the arguments between @formula and fit (i.e. Syntax 2), we can still use something like Syntax 3, i.e. a macro @fit that captures all arguments. Each developer can then deal with evaluating each argument separately depending on the keyword argument

I don't think that works either. The macro needs to decide whether to transform arguments before it knows the type of the model, so it cannot use dispatch to let each package choose a behavior.

matthieugomez commented 7 years ago

The macro would just capture all the expressions and pass them to the function. The function would then specify which argument should be evaluated. Is there a problem with that?

nalimilan commented 7 years ago

Yes, the function would receive expressions which refer to a different scope. For example, maxiter=n would be transformed into maxiter=:n, but n wouldn't exist in the function's scope (actually, the scope of the module), or it could be shadowed by a local variable, by another keyword argument... More generally, calling eval inside functions isn't recommended in Julia.

matthieugomez commented 7 years ago

Correct me If I'm wrong, but I think we can avoid this issue. @fit(fe = z, maxiter = n) would call _fit(fe = :z, maxiter = :n) which would return the expression :(fit(fe = :z, maxiter = n)). This expression would then be evaluated in the user environment.

Just to give a rough sketch:

# Defined in StatsModels
macro fit(args...)
    _fit((transform_expr(a) for a in args)...)
end

function transform_expr(ex)
   if isa(ex, Expr) && ex.head == :(=)
      return Expr(:kw, ex.args[1], Base.Meta.quot(ex.args[2]))
   else
      return Base.Meta.quot(ex)
  end
end

# Defined in my package
function _fit(fe = :nothing, maxiter = :nothing)
   Expr(:call, :fit, Expr(:kw, :fe, Base.Meta.quot(fe)), Expr(:kw, :maxiter, maxiter))
end

# @fit(fe = x, maxiter = n) is then transformed into fit(fe = :x, maxiter = n)
nalimilan commented 7 years ago

Again, that's the same problem: your _fit function will conflict with the one defined in another package since its signature does not have any specific to your package.

matthieugomez commented 7 years ago

The keyword arguments are specific to my package, no? And how is this problem the same as the one regarding the function scope?

ararslan commented 7 years ago

AFAIK Julia doesn't care about the names of the function arguments, just their types. So unless you have a function argument that accepts only a type defined in your package, it will conflict.

nalimilan commented 7 years ago

...and keyword arguments do not even participate to dispatch.

matthieugomez commented 7 years ago

Thanks for the precision. I guess I have to think a little bit more about it.

Nosferican commented 7 years ago

A big advantage of the macro approach over the symbols only is that it allows for operations such as contrasts (&, &&) and eventually feature engineering directly on the formula. I would be happy to have @model which would parse a @formula and manage keyword arguments with a => assign/capture operator. This approach might even help packages that rely heavily on generating a model which then is passed to a fit function.

nalimilan commented 7 years ago

Each package can of course provide its own macro, but it would be nice to be able to harmonise the model fitting interfaces. Regarding propositions 2 and 3, I really don't see the point of reinventing a syntax for formulas.

ararslan commented 7 years ago

Perhaps @dmbates could help unify the tribes.

dmbates commented 7 years ago

Gee, thanks, @ararslan

I was trying to keep my head down.

matthieugomez commented 7 years ago

@nalimilan out of curiosity, what kind of problem would it create to capture every arguments (except for, say, the dataframe), and impose users to directly insert values in the AST, i.e.

n = 10
@fit(df, y~x, maxiter = $n)
nalimilan commented 7 years ago

The main problem would be that it's not standard so it would be surprising/confusing. Note that even df would have to be written as $df for that syntax to be fully consistent, and the same would apply e.g. to the model family ($(Binomial())). So you're trading ~ in formulas with $ for other arguments, which is probably not a win.

matthieugomez commented 7 years ago

Thanks. To avoid the consistency problem you mention, what about a fit(df, ModelType, @model(...)) syntax , where every argument in model is captured?

fit(GeneralizedLinearModel, df, @model(y~x, Poisson(), maxiter = 10))

Note that $ is required only when using @model within a function, i.e. during non interactive use (vs using ~ everytime):

function f(modelfamily, n)
   fit(GeneralizedLinearModel, df, @model(y~x, $(modelfamily), $(n))
end
nalimilan commented 7 years ago

I don't like the idea of a syntax that only works in the global scope (and I suspect I'm not the only one).

matthieugomez commented 7 years ago

The syntax does not only work in the global scope. Exactly like @formula, @model requires expression interpolation when modifying one of its arguments.

I have now defined the macro @model in FixedEffectModels here. It accepts a formula and a set of keyword arguments, and simply returns a Model which contains the captured expressions:

using FixedEffectModels
@model y ~ x fe = v1 + v2 weights = w
# Formula: y ~ x
# fe: v1 + v2
# weights: w

I wish StatsModels had a similar macro.

lbittarello commented 6 years ago

Latecomer to the party. I've been having the same issues, but I wasn't aware of this discussion. My solution consisted of separating data processing from estimation and using keywords. For example:

data = Microdata(DF, response = "y", control = "1 + x1 + x2", treatment = "t", instrument = "z", weight = "w") results = fit(IV, data, method = "TSLS")

It seems similar to Syntax 2 above.

By the way, I don't like the y ~ x structure. I understand that it's understand, but it's awkward. Consider linear IV and IV weighting, for instance. In the first case, y is regressed on x and t, so the formula makes sense. In the second case, z is regressed on x to construct weights for a regression of y on t, so the formula doesn't make as much sense. They shouldn't need different inputs though...

nalimilan commented 6 years ago

It seems similar to Syntax 2 above.

Well, not completely, since variables and formulas are passed as strings, so that Microdata is a standard function, not a macro. Since quotes make it possible to distinguish between formulas and standard arguments (like DF), this solution is more similar to my proposal, which is to use a macro and indicate that an argument is a formula via ~ (or possibly : for single variables).

By the way, I don't like the y ~ x structure. I understand that it's understand, but it's awkward. Consider linear IV and IV weighting, for instance. In the first case, y is regressed on x and t, so the formula makes sense. In the second case, z is regressed on x to construct weights for a regression of y on t, so the formula doesn't make as much sense. They shouldn't need different inputs though...

Sorry, I'm not familiar with the example you are using, so I don't understand the problem. Could you elaborate? Note that you don't need to use y ~ x if that doesn't make sense for your use case: formulas are relatively flexible.

Nosferican commented 6 years ago

In my latest iteration I have decided to follow @matthieugomez scheme for the core formula:

response ~ exogenous + (endogenous ~ instruments)

for fixed effects a one-sided formula

absorb = fe1 + fe2 * fe3

which is similar to Stata's reghdfe. For the "model" I am using a @model which also includes the data,

model = @model(data = df, formula = y ~ x1 + (z1 ~ Z1), absorb = x + t, weights = w)

@model captures the values and puts them in a dictionary which is passed to my struct generator which parses it and populates the package struct <: StatsBase.RegressionModel. StatsBase.fit! is expanded to have a method for the package struct.

nalimilan commented 6 years ago

I'm still not convinced that a macro capturing all arguments is a good idea. That means @model will only work for your package, and if any other package wants a similar mechanism it will have to use another name or experience conflicts. Also, as I noted, it doesn't allow passing references to local variables (except via ugly hacks for the global scope only). Really, requiring people to write ~ x + t and ~w or :w doesn't sound like a big deal and it would make the macro much more standard.

Nosferican commented 6 years ago

A possible fix for not having to have a qualified @model is to have it part of StatsModels, or DataFrames for the moment, as:

macro model(args...)
    args = Dict(StatsModels.parsemodelargs.(args))
    model = StatsModels.modelbuilder(args)
    return model
end
function StatsModels.modelbuilder(args::Dict{Symbol,Any})
        # Some code that dispatches the module's modelbuilder based on
        # get(args, :module, :Default)
end
StatsBase.fit!(model::ModuleStruct)

and dispatch on the module the user wants to use. The only argument I can think of that could or should be local I think is the data. I use,

function parsemodelarg(obj::Expr)
    args = getfield(obj, :args)
    field = first(args)
    value = last(args)
    if field == :data
        tmp = Expr(:call, :(x -> return x), :data)
    tmp2 = getfield(tmp, :args)
    tmp2[2] = value
    setfield!(tmp, :args, tmp2)
    value = eval(Main, tmp)
    end
    ...
    return (field, value)
end

which could potentially allow for accessing that local reference in other spaces besides the Main module space (global). Again, I that would only be used in very limited instances I would think, but unsure if that can be used in all applications that you can think of. Otherwise, one can keep data apart and do something like FixedEffectModels.@model

@reg(df, @model(...))

A limitation for using ~ x + t is that @formula and ModelFrame expect a :lhs that is not nothing in order to work limiting its use.

nalimilan commented 6 years ago

Anything that depends on eval is a no-go for any standard package like StatsModels (and it's not a great idea in general). And you can't know in advance which argument is going to be a formula, and which one should be a local variable -- except by hardcoding their list in StatsModels, but we don't want to restrict what packages can do.

A limitation for using ~ x + t is that @formula and ModelFrame expect a :lhs that is not nothing in order to work limiting its use.

Maybe we can work on finding a good solution with formulas? We could allow an empty LHS, or use :(...), or find another symbol...

kleinschmidt commented 6 years ago

How about a @model macro that takes a data frame and a model constructor and replaces all the ~-quoted arguments with model matrices constructed from the data frame?

kleinschmidt commented 6 years ago

something like @model df GLM(z ~ 1+x*y, family=:binomial)