Open matthieugomez opened 5 years ago
This came up in slack the other day and some of the discussion there helped me clarify my thinking a bit so I wanted to preserve it for posterity. The underlying issue is that 1) we want to support functinos like categorical
, scale
, lag
/lead
/diff
which operate at the level of an entire column, but 2) there are a number of contexts where we can't just apply a standard julia function to the entire column: row-wise (or even batched streaming) data are one obvious case (since you only have a subset of the observations at once), but another one is predict
, since you're getting new data. One of the primary design motivations for #71 Terms 2.0 Son of Terms was to abstract information about the column-level transformations into terms (<:AbstractTerm
), which could then work for any amount of data (hence the separation of syntax/schema/data time in the API). This allows us to e.g., avoid special-casing handling of categorical values in predict
, instead providing an API that anyone can plug into to support arbitrary transformations.
So, if the design of the API precludes treating column-level transformations as "standard" julia functions, then we're left with a design choice:
scale
, categorical
, etc., that depend on invariants of the data available at schema time, and change the "function term" syntax to require broadcasting for elementwise application, making column-wise application of normal julia functions the standard or The status quo is 2., which I still prefer because I think users would be very confused if some functions operated "normally" but others were "special", and get frustrated when they write a function that takes a whole column and try to use it in a formula and get weird (or even worse, invisibly wrong) results when they e.g. try to generate model predictions on held-out data. And even though the conventions I've adopted in #71 differ from idiomatic julia, the formula DSL is a DSL, and it differs from idiomatic julia in many other ways (e.g., #99).
I think the ergonomics of the current situation could be much improved, by e.g. adding a "column-level" wrapper term type that would encapsulate a function call that can safely be applied to a whole column, and even maybe define an API for how column-level terms that need data invariants (like cateogrical, which needs to know the unique levels, or scale
, which needs to know summary statistics to use in scaling) can store, extract, and access them. But I don't have time at the moment to really push on that (maybe this summer?).
Note that we could also go with 1., but throw an error when a non-elementwise function is called, unless it's special-cased. That would probably be the clearest and the safest solution: otherwise some functions may appear to work, but be applied to each element instead of the whole vector (e.g. scale
would do that without special casing, and that can be the case for any custom function).
But of course the drawback of that approach is that it would be inconvenient: y ~ x + x^2 + log(z)
would become y ~ x + x.^2 + log.(z)
. Also the mix between +
and .^
would be weird.
Patsy (a python "formula" package) makes a distinction between functions (which can be applied elementwise) and "stateful transforms", which need to know some invariants of the data (e.g., center or standardize/scale). There's a good dicusssion of why this distinction is necessary in the docs which is much clearer than my arguments above: https://patsy.readthedocs.io/en/latest/stateful-transforms.html
That might be a useful abstraction, especially for developing first-class support for streaming data while still being extensible. Note that ContinuousTerm
is basically a function, while CategoricalTerm
is stateful (since it needs to know the number and values of the unique levels of the data).
In the current implementation, transformation (such as log) are applied elementwise. AFAIU, this allows
StatModels
to work with any streaming interface, not justDataFrame
. However, this has two drawbacks:log.(x)
This may be fine, but I just wanted to have a discussion on whether it was the right path going forward. See also: https://github.com/JuliaStats/StatsModels.jl/issues/75#issuecomment-470210604_ https://github.com/JuliaStats/StatsModels.jl/pull/71#issuecomment-473313854_