JuliaStats / StatsModels.jl

Specifying, fitting, and evaluating statistical models in Julia
251 stars 31 forks source link

Support panel and time aware sources #90

Open Nosferican opened 5 years ago

Nosferican commented 5 years ago

Virtually every statistical environment supports the notion of a dataset having repeated observations and potentially multiple units of observations (i.e., longitudinal). Terms 2.0 introduced terms that operate between the data and the representation. Terms such as lag/lead,diff, and others require the notion of time and potentially a group by approach. Software usually handles this by xtset panelvar timevar in Stata, pdata.frame(index = c(id, time)) in R's plm, ID cross-section-id <time-series-id> in SAS software CPANEL Procedure, etc. Any thoughts on how to best integrate that within StatsModels?

nalimilan commented 5 years ago

The implementation depends on whether the time and grouping information is stored in the table or passed to the model specification. The latter is probably simpler to implement since we could just define a special function term and do whatever is needed to compute the column.

If we also want to support custom table types which would carry that information, we need an API (in Tables.jl?) for StatsModels to retrieve it and use it in the method for the function term.

Another issue is whether specific models would want to do something with that information, e.g. computing clustered standard errors. Then we would also need to store that somewhere.

Nosferican commented 5 years ago

Obtaining weights, custom representations such as high dimensional features to absorb / clusters can be parsed from the FormulaTerm / data. However, it is different with terms such as lag, lead, diff, which require a context, for example a time dimension and optional panel identifier. Hence, why I think these might be worth considering a special handling. For example, maybe passing a context to modelcols that makes it aware of time/panel. It would then default to something like mapreduce(table -> modelcols(t, table), vcat, groupby(data, pid)) like. Maybe modelcols such needs a context variable that can hold the time/panel information to have the implementations of these terms. Another option is two have it as part of the terms, @formula(log(y) ~ diff(x, pid = County)), but make it easier to use by standardizing pid/tid or something.

nalimilan commented 5 years ago

Maybe we should pass time and groups via keyword arguments to @formula, just like we discussed for weights. We would then have a generic mechanism to retrieve all these "non-matrix terms" by name for models and methods that need them.