JuliaStats / StatsModels.jl

Specifying, fitting, and evaluating statistical models in Julia
248 stars 30 forks source link

Use some character other than * for "main effects and interactions" #99

Open oxinabox opened 5 years ago

oxinabox commented 5 years ago

* is quiet possibly the most confusing charact for "main effects and interactions" .

Because a*b becomes a&b +a.

Then for continous a and b that & for the interactions term, will get translated into scalar multiplication, which is normally notated using *. This makes it really hard to explain to people (I know, I just tried). and it seems like a common mistake when wanting to do a&b would be to write a*b.

Alternative character options I suggest isa && b. it is like & but do more of it

cc @nickrobinson251

kleinschmidt commented 5 years ago

I don't think && is available right? It's not parsed like a normal operator.

On Apr 9, 2019, at 10:25, Lyndon White notifications@github.com wrote:

  • is quiet possibly the most confusing charact for "main effects and interactions" .

Because a*b becomes a&b +a.

Then for continous a and b that & for the interactions term, will get translated into scalar multiplication, which is normally notated using . This makes it really hard to explain to people (I know, I just tried). and it seems like a common mistake when wanting to do a&b would be to write ab.

Alternative character options I suggest isa && b. it is like & but do more of it

cc @nickrobinson251

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

oxinabox commented 5 years ago

It would need some special casing since it parses to Expr(&&, a, b) rather than Expr(:call, &&, a, b).

But I don't see that as a huge problem?

ararslan commented 5 years ago

it is like & but do more of it

:joy:

But I don't see that as a huge problem?

Using && is not necessarily a problem for the one(s) implementing the change, but I would find that choice incredibly odd as a user.

oxinabox commented 5 years ago

Do you have a better character?

* is maximum confusing.

^ might be ok?

Or plenty of Unicode options

ararslan commented 5 years ago

Do you have a better character?

I don't, really. You mentioned that

it seems like a common mistake when wanting to do a&b would be to write a*b.

FWIW that is how interactions are expressed in SAS, which uses | for main effects and interactions. I imagine that the choice here for * is intended to mimic R, which uses : for interactions and * for main effects and interactions. From that perspective, I think * is a reasonable choice for this, even if it isn't ideal. Python's statsmodels and patsy packages also base their formula syntax off of R.

^ might be ok?

Maybe, but it'd be nice to keep ^ free for use in specifying powers in variable transformations.

oxinabox commented 5 years ago

So SAS has | for main effects and interactions and & for just interactions?

Seems legit.

ararslan commented 5 years ago

No, sorry I wasn't clear. In SAS, a model statement lists covariates like x1 x2 x1*x2 for main effects and interaction for x1 and x2, or equivalently x1|x2. So it uses * for interactions.

oxinabox commented 5 years ago

I think | is ideal for replacing what is currently called *

Consider the truth condions of a|b, It as a or b or both (a&b). And we can think of that as what predictor Variables to use: main effects: a or b or their interaction a&b

nalimilan commented 5 years ago

I'd rather keep the current convention, which as @ararslan noted is consistent with R and StatsModels.org, and partly with SAS. Technically * is the right operation for an interaction term. We slightly abuse it since it also generates main effects, but that's reasonable since it's the most common situation. I would be OK with using something other than & for interactions only, e.g. **.

| is already used in MixedModels.jl to specify random effects (again, consistent with R).

oxinabox commented 5 years ago

I would recommend spending 15 minutes trying to explain this package to someone who has neither read the docs, nor used R. When I tried, the response I got when I said:

So a & b on scalar does interaction terms, which for continous variables is the same as doing a*b on the elements of the DataFrame column, so with say a linear model you end up solving for some scalar x, in x1*a.*b. And then a * b does main terms and interactions, which is the same as a + b + a&b which is the same as having 3 column a, b and one defined by multiling the two a*b. So in a linear model you end up solving for constants x1, x2, x2 in x1*a + x2*b + x3*a.*b

was very WAT?! and a minute later of me reexplaining. :-S So & in a formula is *? and * in a formula is not *. WAT?

And that was before I got up to saying that (a*b)^1 was multiplication again.


and partly with SAS

No: what I am proposing is partially compabile with SAS,. What we have right now is not at all compatible with SAS. I checked the SAS docs

Summary:

But I would also caution on the value of mimicing other things too closely. Since we have the chance to do better.

nalimilan commented 5 years ago

There will always be a discrepancy between what special operators like + and * mean in formulas and what they mean mathematically. As you noted, a+b means x1*a + x2*b, not x1 * (a+b). We would have to stop using + if we wanted to be completely consistent.

I agree the current system isn't perfect, but I don't find using & and | very obvious either. FWIW, the origin of the formula syntax in R is http://www.jstor.org/stable/2346786.

oxinabox commented 5 years ago

I agree the current system isn't perfect, but I don't find using & and | very obvious either.

Can we brainstorm a bit more, before the release is tagged? Next release will be breaking anyway.

ararslan commented 5 years ago

I don't mind the current system personally. I actually kind of like & for interaction because, read aloud, a&b is "a and b," which sounds like an interaction term.

nickrobinson251 commented 5 years ago

I actually kind of like & for interaction because, read aloud, a&b is "a and b," which sounds like an interaction term.

yeah, this seems fine to me too :)

But * for a + b + a&b seems unnecessarily confusing (to me), if we can do better. FWIW I like | but I think && is also fine, even ** (it's weird but not confusing) ...or some unicode character if there's an appropriate one.

ararslan commented 5 years ago

I am very much against &&, and | is taken by MixedModels, as Milan noted.

If we were to change something here, I think the thing that would make the most sense to me would be to switch the meanings of * and &, so that * is interaction and & is main effects with interaction. However, our formula notation has been in use since the very early days (it well predates my use of Julia, which began with 0.2), so that would be hugely breaking for a lot of existing code and learning materials...

nickrobinson251 commented 5 years ago

| is taken by MixedModels

whoops, sorry, missed that

that would be hugely breaking

Yeah... fair enough

For me * is an unfortunately unintuitive bit of notation, but I understand that formula is a DSL and that I'm only coming to this package post Terms 2.0 so 🤷‍♂️

matthieugomez commented 5 years ago

I also think the current situation is confusing and I would rather have any of the solution mentioned in the thread.

nickrobinson251 commented 5 years ago

I think && would be acceptable :)

Alex doesn't like it but has not yet said why -- seems in keeping with & and much less confusing than * not being multiply in this DSL

(* and ^ have mathematical meanings outside the DSL and may be expected to act the same in the DSL, e.g. ^ for specifying powers in variable transformations.)

ararslan commented 5 years ago

Alex doesn't like it but has not yet said why

It is a heinous pun on a control flow operation. I would find && not meaning "short-circuiting logical and" far more confusing than * not meaning multiplication.

nickrobinson251 commented 5 years ago

we do already use & and it doesn't mean "bitwise and" (and that seems less bad than * to me since it is not common equation term)

other suggestions welcome :)

oxinabox commented 5 years ago

(and that seems less bad than * to me since it is not common equation term)

This bit i think is important. The fact that noone would want to write && to mean "short-circuitting logical and", in an equation.

matthieugomez commented 4 years ago

Could we switch to & and && in the next version (deprecating ), and then switch to and **.

matthieugomez commented 4 years ago

Bump. I would really support using one character (any character!) to denote interaction and two characters to denote main effects and interactions. Also, I think && is really fine. The current situation leads users to make mistakes, which is more problematic IMO than “heinous puns”.

palday commented 4 years ago

I think it's time to emphasize that the formula syntax is in a very real sense not Julia code: it's a DSL and that's why it's wrapped in a macro call. A lot of things in the formula syntax don't act they do in normal Julia, e.g. function calls (nominally you're writing columns, but functions that act on vectors may or may not work, while functions that act on scalars do work because of the implicit broadcasting).

The DSL in question has a very long tradition and is more formally called Wilkinson-Roger notation. It's not just Julia copying R here; this DSL is used in Python's Patsy, StatsModels.org, SAS, and is well-known and used in various parts of the statistical literature, including various textbooks. Breaking compatibility with the other implementations of this DSL is for me much worse than breaking compatibility with Julia, which is per definition a different language. It's unfortunate that nobody agrees on what the stand-alone interaction term looks like, but it is comparatively rare to have interactions without main effects and essentially all the implementations agrees that main effects are + and main effects with interactions are *.

In other words, we may make a few pure Julia users happy with something that matches Julia, but all other users of the DSL in question are going to be very unhappy and we're going to create problems for all the package maintainers who depend on StatsModels when their users' formulae stop working.

kleinschmidt commented 4 years ago

I think there's room for a middle ground here: splitting off most of the basics into a StatsModelsBase package and tweaking how function calls are handled as @oxinabox has proposed elsewhere (#117 I think) might mean that most if not all of the Wilkinson-Roger notation can be split out into another package since they just overload normal julia calls (I've started to play around a bit with some other syntax people have requested in https://github.com/kleinschmidt/RegressionFormulae.jl). That would mean that other modeling packages are free to provide their own DSL syntax via their own package or other stand-alone syntax packages.

While I think that should still be discouraged it's important to recognize that no DSL syntax is going to be perfectly intuitive or match the needs of different modeling paradigms, and my hope is that statsmodels can provide a foundation to build useful DSLs across a wide variety of modeling paradigms in the Julian spirit of modularity and composability.

oxinabox commented 4 years ago

I think it's time to emphasize that the formula syntax is in a very real sense not Julia code: it's a DSL and that's why it's wrapped in a macro call. A lot of things in the formula syntax don't act they do in normal Julia, e.g. function calls (nominally you're writing columns, but functions that act on vectors may or may not work, while functions that act on scalars do work because of the implicit broadcasting).

To be clear the problem is not that it is not like normal Julia code. The problem is that * is the single most confusion choice of character to do this. Because multiplication in the standard action for interaction terms.

Like having * be literally any other symbol is fine.

--

There is no problem with a DSL being used. The problem is solely with the choice of characters used in the DSL. We are creating a new instance of the DSL and need to consider the options to make it the best instance possible.

Breaking users code is sad sure. But they have Project.toml with compat set. They are not actually going to have the code break. And the whole reason we have not tagged 1.0 is because we are still considering options which may be breaking. It does not do to discard options out of hand just because they are breaking. So while it's an argument against, and a decent one. It doesn't carry the same weight it would had the package tagged 1.0. People chose to use software when it was admitting that the API can change.

Tokazama commented 4 years ago

Perhaps putting aside the pros and cons of changing the current syntax compared to the standard and focusing on the basic functionalities we need would be a good first step. In the end * is just a convenient way of combining two things and if it did go away we could ultimately get by without it. However, we should be careful not to use syntax that could be valuable in the future somehow. I think establishing the basic needs in something like RegressionFormulae would be a good first step for objectively solving this. Id be happy to discuss specific terms there and build up to a solution.

palday commented 4 years ago

To be clear the problem is not that it is not like normal Julia code. The problem is that * is the single most confusion choice of character to do this. Because multiplication in the standard action for interaction terms.

But even that is an assumption from Julia! Although many languages use * for multiplication, that is not universally the case nor is * used only for multiplication in other languages. Moreover, * is not generally used in mathematical (pedagogical) texts to indicate multiplication, while * is used in statistical (pedagogical) texts to indicate interactions with corresponding main effects.

While you may actually want the product of two terms, that is in itself a hassle to define clearly within the formula syntax because the type of product you want differs depending on what those terms are. For two continuous terms, it's the sample element-wise multiplication of the field elements. But for two categorical terms, it's the Kronecker product. For a continuous term and a categorical term, it's the elementwise multiplication of the continuous term with each of the expanded factors. And what happens if you multiple a scalar by a categorical variable?

As a natural language example: most European languages express "no" with a word starting with "n" followed by "e" or "o", but Greek uses "ne" to indicate yes. Likewise, most Europeans nod to indicate agreement, but nodding means disagreement and shaking your head indicates agreement in Bulgarian, but we don't ask speakers of those languages to change their symbols because it's confusing to people coming from other languages. While you may find saying "ne" to mean "yes" is confusing, a Greek speaker saying "ne" to mean "no" in Dutch finds it equally confusing. But the native speakers of each language determine what the correct form is for that language. Here, that means that the "native speakers" of Wilkinson-Roger determine what notation works -- and there is widespread consensus in the broader community of Wilkinson-Roger users that * means "interactions with main effects".

behinger commented 4 years ago

If I read the papers correction, A x B => A + B + A·B would be the historically correct way (Wilkinson 1973, Nelder 1965) (the x comes from crossed, as in crossed latin-square design).

But for obvious reasons x is not a good operator in a computer (prior to julias unicode?) and A x B seems to be appreviated as A*B (Wilkinson 1973, implemented in R, Python, Matlab). Thus the asterix should be thought of as the crossing operator, not multiplication in this context (whether this is still intuitive, I don't know). This syntax is also broadly used in to describe models in many research papers etc. I think that changing the use of the asterix would likely be a massive source of missunderstanding in communication of model-specifications if the same symbols are used.

I personally think explaining it as crossing instead of multiplication would communicate the idea clearly and be backwards-compatible.