Changes to the Formula interface

johnmyleswhite commented 10 years ago

I'd like to propose a few changes to our Formula/ModelMatrix interface:

Make it possible to generate a model matrix in which observations are columns in addition to the current implementation in which observations are rows.
Make it possible to reuse a fixed storage array while considering different DataFrames, each of which is guaranteed to produce the same size model matrix. This is really important for doing streaming linear regression.
Automatically treat string columns as factors.
Automatically treat integer columns as floats, except that specifying factor(IntColumn) in a formula will treat this value as a categorical factor instead.

HarlanH commented 10 years ago

I can't think of any reason not to do any of these. Presumably the first one lets you use existing libraries that assume column observations without needing to write wrapper code or do a transpose. The second one sounds really useful. The alternative to the third one is throw an error? And for the fourth one, that makes sense -- it's just promoting all the different numerical types (boolean too, I imagine) to Float64, right?

johnmyleswhite commented 10 years ago

Right now we throw an error for string columns.

Yes, we should promote all numerical types to Float64 for model matrices. Right now we throw errors for anything other than floating point types.

nalimilan commented 10 years ago

Sounds good. About point 4, I was just wondering whether some models might want to handle integers as such rather than as floats, e.g. for efficiency when treating count data, but that's probably crazy.

johnmyleswhite commented 10 years ago

I'm going to go ahead and make some of these changes, but it's become clear to me that we have much bigger problems to contend with if we want to make formulas work well.

The core problem is that R doesn't use consistent semantics for ~. In R, ~ is basically a warning that the formula argument is an instance of a custom DSL, whose semantics vary from package to package. For example, the interpretation of a ~ b in most packages by Hadley Wickham is that the left-hand side of the tilde operator is a list of variables that will correspond to rows of a function's output and that the right-hand side of the tilde operator is a list of variables that will correspond to columns of a function's output. This is totally unlike the semantics used in linear modeling, where the left-hand side of a formula describes the predicted variables of a linear predictor model and the right-hand side of a formula describes the columns of the model matrix of predictors.

Right now, we could support this with DataFrames because we just parse ~ operators as calls to the @~ macro, which then transforms things into a Formula object. Functions can treat that Formula object however they'd like.

But we, at present, do a bunch more than that in DataFrames: we also define operations that translate Formula's into model matrices. I think this is a good thing: it means that we can standardize the semantics of ~ and use it to introduce a coherent DSL for linear modeling that operates across packages.

To get this right, we need to make sure that we agree on the semantics of the Formula -> ModelMatrix conversion process.

Here's my current take on the topic:

The left-hand side always refers to a vector-valued output. This can be either a column of a DataFrame or a series of simple functions applied to a column of a DataFrame.
The right-hand side always refers to a matrix-valued design matrix of predictors. This can be constructed from any combination of columns of a DataFrame, simple functions applied to columns of a DataFrame and interactions of columns, which are just elementwise multiplications of columns.

This should handle the basic cases we need for simple GLM use cases.

There are some problems with this approach:

How do we handle outputs that are matrix-valued, as occurs in multinomial regression?
How do we handle non-linear regression? Should NLS actually involve the construction of a design matrix?

nalimilan commented 10 years ago

Makes sense. Another use case I can see:

How do we handle mixed-effects models? Cf. R's lme (fixed | random) syntax.

There's also the more general issue of non-model formulas. I think it's fine to let them aside for now and concentrate as you do on model matrices. (The cases I'm thinking about are frequency tables, where there is no left-hand side, other pivot tables (one or several results like mean, median, standard deviation, etc. on the left-hand side, across one or several factors on the right-hand side), and various plots, which are basically graphical versions of pivot tables.)

nalimilan commented 10 years ago

Oh, and there are also cases where, like with random effects, you need to distinguish between "standard" variables and "special" ones. In survival/event-history models (AFT, PH, Cox...), you sometimes want to specify that some factors correspond to different strata (i.e. different baseline distributions for each level). With some models, you also need to be able to say that one variable affects a given parameter of the distribution, and another variable a different parameter (and sometimes both). In R's survival or flexsurv packages, this is handled using pseudo-functions in the formula, like strata() or shape(), scale()...

Similarly, the mlogit R package allows three different kinds of variables [1]. It uses | to separate them. I find this less obvious than specifying explicitly the type of effect you want using pseudo-variables (or another syntax).

1: http://cran.r-project.org/web/packages/mlogit/vignettes/mlogit.pdf

johnmyleswhite commented 10 years ago

Regarding mixed-models, I'm happy to have the DataFrames definition of ModelMatrix support the most general, but also fully unambiguous, parsing of Formula's possible. My main interest is having a function that transforms DataFrames into matrices into a consistent way that many packages (like glmnet) can use to expose a nice DataFrame interface.

In general, I'm pretty opposed to the usage of non-model formulas. For plots and tables, I think it's much, much, much clearer to have arguments called rows = [:a, :b] and cols = [:c, :d].

Another way to make my point: we shouldn't encourage people to pun on ~ unless it's absolutely necessary to their problem. I think the multiplicity of semantics used for ~ isn't one of R's strengths. As it stands, ~ is basically an alias for quoting a two-part expression.

nalimilan commented 10 years ago

Indeed, it would be a good alternative. Gadfly works more or less that way currently, with a syntax for DataFrames of the form plot(data("datasets", "iris"), x="Sepal.Length", y="Sepal.Width", Geom.point). With the move to symbols for column identifiers, I guess it should look like this plot(data("datasets", "iris"), x=:SepalLength, y=:SepalWidth, Geom.point).

There's just one question: suppose you want a pivot table with mean and standard deviation for two groups as columns, crossed with a factor as rows. Something like:

      Group1     Group2
      M   SD     M    SD
L1
L2

A formula for this could be mean(X) + std(X) + factor1 ~ factor2. Do you think this would also work as row=:(mean(X) + std(X) + factor1), col=:factor2?

johnmyleswhite commented 10 years ago

I think that's a little too featurey to worry about just yet, especially since it seems to implicitly use column hierarchies, which we don't yet support. For now, I'd prefer that we just let people do reshaping and split-apply-combine operations.

nalimilan commented 10 years ago

Sure, let's concentrate on models for now.

nsgrantham commented 8 years ago

How do we handle outputs that are matrix-valued, as occurs in multinomial regression?

Is this addressed in the current Formula/ModelMatrix interface, or elsewhere? I'm willing to take a closer look at this if not.

nalimilan commented 8 years ago

@nsgrantham The current implementation (it lives in DataFrames.jl right now) only supports a single symbol on the LHS. It shouldn't be too hard to allow any expression (i.e. several terms), like for the RHS. I think that would work, i.e. one would write x + y ~ z for multinomial regression or similar models (instead of cbind(x, y) ~ z in R). Is that what you'd need?

nsgrantham commented 8 years ago

@nalimilan Yes, exactly.

However, I don't believe this notation is quite appropriate for multinomial regression. Each "variable" on the LHS is really a category so they are actually much different than variables on the RHS. For example:

using DataFrames  
using Distributions  
n, m, p = 100, 10, ones(3) ./ 3  
Y = rand(Multinomial(m, p), n)'  
X, Z = rand(n), rand(n)
df = convert(DataFrame, [X Z Y])  
names(df)  # :x1, :x2, :x3, :x4, :x5  
ModelFrame(x3 + x4 + x5 ~ x1 + x2, df) # correct
ModelFrame(x3 + x5 ~ x1 + x2, df)      # incorrect, draws are made conditional on m & p
ModelFrame(x3 + x4*x5 ~ x1 + x2, df)   # ???

It is not clear to me what a suitable solution would be. Maybe allow for the LHS to refer to a stand-alone matrix rather than column(s) of df?

typeof(Y)  # Array{Int64, 2}
df2 = convert(DataFrame, [X Z])
names(df2)  # :x1, :x2
ModelFrame(Y ~ x1 + x2, df2)  # @assert size(Y, 1) == size(df2, 1)

nalimilan commented 8 years ago

So this would only be required to fit multinomial regression from count data: with individual observations, a single dependent variable with several levels would be enough. I don't know how common the count data scenario is, but it doesn't fit very well in the design. One solution, like for binomial logistic regression currently in GLM.jl, is to have one row/observation for each possible case, with (frequency) weights giving the counts. It should work.

If you really want to pass the counts in the LHS, we could use another symbol like x#y ~z or x/y ~ z (or anything, really) to make it clear it's not the same as +.

Anyway, I don't think having a symbol in a formula refer to a matrix outside of the data frame would really work well in Julia. And I must say I don't like it in R either, as it makes it very complex to resolve a symbol (in particular for package authors implementing new models), prevents re-fitting a model on an updated dataset, and doesn't work with out-of-memory datasets.

ValdarT commented 6 years ago

Make it possible to generate a model matrix in which observations are columns in addition to the current implementation in which observations are rows.

I think this + allowing Clustering.jl, MultivariateStats.jl etc., to work on Formula+DataFrame directly like in GLM.jl would be a considerable usability win for day-to-day data analysis workflows. Perhaps a good time now that the big changes around Missing data and DataFrames have settled and Julia 1.0 is close so its actually feasible to start using Julia instead of, say, R in more contexts?

nalimilan commented 6 years ago

Yes, somebody just needs to do the work. ;-)

JuliaStats / Roadmap.jl

Changes to the Formula interface #3