Open Nosferican opened 5 years ago
The implementation depends on whether the time and grouping information is stored in the table or passed to the model specification. The latter is probably simpler to implement since we could just define a special function term and do whatever is needed to compute the column.
If we also want to support custom table types which would carry that information, we need an API (in Tables.jl?) for StatsModels to retrieve it and use it in the method for the function term.
Another issue is whether specific models would want to do something with that information, e.g. computing clustered standard errors. Then we would also need to store that somewhere.
Obtaining weights
, custom representations such as high dimensional features to absorb / clusters can be parsed from the FormulaTerm
/ data. However, it is different with terms such as lag
, lead
, diff
, which require a context, for example a time dimension and optional panel identifier. Hence, why I think these might be worth considering a special handling. For example, maybe passing a context to modelcols
that makes it aware of time/panel. It would then default to something like mapreduce(table -> modelcols(t, table), vcat, groupby(data, pid))
like. Maybe modelcols
such needs a context
variable that can hold the time/panel information to have the implementations of these terms. Another option is two have it as part of the terms, @formula(log(y) ~ diff(x, pid = County))
, but make it easier to use by standardizing pid
/tid
or something.
Maybe we should pass time and groups via keyword arguments to @formula
, just like we discussed for weights
. We would then have a generic mechanism to retrieve all these "non-matrix terms" by name for models and methods that need them.
Virtually every statistical environment supports the notion of a dataset having repeated observations and potentially multiple units of observations (i.e., longitudinal). Terms 2.0 introduced terms that operate between the data and the representation. Terms such as
lag
/lead
,diff
, and others require the notion of time and potentially a group by approach. Software usually handles this byxtset panelvar timevar
in Stata,pdata.frame(index = c(id, time))
in R's plm,ID cross-section-id <time-series-id>
in SAS software CPANEL Procedure, etc. Any thoughts on how to best integrate that within StatsModels?