default intercept behavior

kleinschmidt commented 7 years ago

I'd like to change the default intercept behavior in formulas. Currently, we have this:

~ 1 + x intercept column and x
~ x intercept column and x
~ 0 + x or ~ -1 + x or ~ x-1 no intercept column, only x.

Including the intercept by default might be sensible when all you're doing is regression. But it's not always the most appropriate thing (e.g. ararslan/Survival.jl#2) and making people use ~ 0+x when they don't want an intercept feels like an unnecessary gotcha, Especially if we want formulas to be useful as a general "glue" between tabular data and numerical arrays.

I think we've largely agreed that this is the desirable behavior:

~ 1+x intercept and x
~ x just x
~ 0+x or ~ -1+x or ~ x-1 an (informative) error.

If that's the case, let's implement that before we release this as a replacement for the statsmodels code in DataFrames.

piever commented 7 years ago

Just to repeat here some concern that I was expressing in ararslan/Survival.jl#2, for Cox regression we need a formula such that, if we only have a variable x:

if x is continuous, then the number of columns in regression matrix is equal to one
if x is categorical and can take n values, then the number of columns in regression matrix is equal to n-1

Would there be a syntax for that?

ararslan commented 7 years ago

Your first bullet point should be the default if there's no intercept term. That would just be ~ x. The second bullet point is interesting... I kind of think that should be the default for categorical vectors, but there has to be a way to specify the reference level.

piever commented 7 years ago

I'd also tend to think that ~ 1+x should have exactly one column more than ~ x or at the very least find it puzzling that the two things should be different for continuous variables but equal/equivalent for categorical. Once we take ~ x to be the same as ~ 1+x in the categorical case, it becomes a bit hard to express that we want one column less.

nalimilan commented 7 years ago

I think the Cox model issue is mostly orthogonal to this issue: whether ~ x includes an implicit intercept or not when x is continuous, we will need a way to indicate that the model matrix shouldn't be full rank when x is categorical. So we need another mechanism anyway.

Regarding the question of the implicit intercept, I agree that ~ 1 + x should be the recommended way to fit a model with an intercept, but I strongly disagree that ~ x should silently fit a model without an intercept for all models. Other standard software (R, Stata, SAS, Python's StatsModels) do not require 1 +, which means that people will forget it all the time and get nonsensical results notably for linear regression. That's going to give Julia a bad reputation of being tricky for newcomers and being reserved for experts.

Last time we discussed this (see also the older discussion on the Google Group), everybody agreed that one should be explicit by using 0 + or 1 +, else an error should be thrown. Models for which an intercept does not make sense (like the Cox model) could opt-out of this behavior using the same mechanism as the one used to specify that a categorical x should not give a full-rank matrix.

An alternative solution which would be equally safe would be to allow ~ x to mean "no intercept", but implement checks in packages before fitting the model. Fitting a model without an intercept would only be possible when passing intercept=false to fit; else an error would be thrown (of course, not for the Cox model). That would have the advantage of not requiring 0 + in StatsModels, with the drawback that each package would have to implement the checks (but that's not too hard to do).

piever commented 7 years ago

This makes a lot of sense. Maybe the suggestion from ararslan/Survival.jl#2 is still valid in this scenario: adding a

default_intercept(T) = error("Be explicit about the intercept")

in StatsBase that is run every time the user doesn't specify the intercept explicitly (i.e. by typing ~ x). Then each model in a separate package could overload this function if it is obvious from the choice of the model whether the intercept should be present or no:

StatsBase.default_intercept(T::Type{CoxModel}) = false

I guess Linear Models could instead default to true and models where both make sense should leave the error. This seems both safe and user friendly and doesn't require any dependency on StatsModels (only StatsBase).

The same mechanism could be used for the other question with a is_categorical_full_rank(T) function which would default to true and would be overloaded for Cox models. The only thing to be decided would then be whether there is ever need for the user to specify this, or the model can always take care of deciding instead of the user. If the former is true, then we should also come up with some syntax for it.

kleinschmidt commented 5 years ago

With #71 we can have our cake and eat it too: by dispatching on the model type during apply_schema, the default behavior for StatisticalModels is the old (default intercept), while the fallback (other or Nothing context) is to not include an intercept by default.

Also, #71 introduces a trait (implicit_intercept) which provides more fine-grained control

JuliaStats / StatsModels.jl

default intercept behavior #31