Open oxinabox opened 5 years ago
I wonder whether it's worth re-implementing what's basically a join here for this special case...could we use DataFrame's join or some such?
I guess this is kind like join.
Main similarly is how no matching things are handled.
But on a column that we don't instantiate index_col .- lag
.
To do this with DataFrames.join we would need to instantiate that column, and put the main column into a DataFrame too. Which since it may be a matrix would likely end up requiring a copy and then gluing it back together and ... And also we would need to put it and the index column into another DataFrame that we join on.
I think this would be surprisingly frustrating to make work and slower than it should be. And would only work on DataFrames. And would not allow approx equality.
Idk of there is a generic Tables.jl join anywhere.
TableOperations.jl was supposed to host standard operations on generic tables. There's also QueryOperators.jl, which already provides join
and is relatively lightweight. But it might not be worth worrying too much about duplication if the code for this particular join isn't too complex. That's one of the advantages of Julia.
Anyway, I feel like the more debatable part of this is the whole IndexTerm idea, and having 2 argument lag work by deciding a column is an IndexTerm.
Rather then the details of how to use that index -- which we can always improve later; without a breaking change, since it is an internal detail
also how should we differenciate this from the Basic Lag in #108 ?
I feel like having simple row-wise lag is also something we want,
Maybe this should be called indexedlag
? ilag
?
Or the one iin #108 basiclag
or lagobs
or lagrows
? or structural_lag
I'm not a fan of automatic detection of "index" terms. I'd prefer them to be specified explicitly, either as an additional argument to lag
(as you suggest; I don't think a separate function is needed), or as a kind of "meta" term (similar to what we discussed for weights or clusters, see https://github.com/JuliaStats/StatsModels.jl/issues/21). Maybe lag
could default to using the index from the "meta" term if it's provided, but you would also be able to specify/override it via an argument.
I also think implict index is limitted it its use.
The main use case is for things that have exactly 1 indexy looking thing,
e,g, exactly one DateTime
column.
But basically all the data I work with has at least two:
It has a release date, which is when the data became available,
and it has the actual datetime column which is when the data is about.
So for forecasts, release date will be before the actual datetime; and for historical data, there is often an embargo between them.
So maybe we don't go anywhere with index terms.
lag
/lead
/diff
and other time operators have both a temporal/order index and a group index. For example, if a dataset has repeated observations per individual the lag
should be performed by unit of observation and not leak observations from another individual for instance.
In Stata-lang, xtset
(unit of observation identifier and a temporal identifier).
One alternative is to simply define lag
on AbstractVectors:
function lag(x::AbstractVector, t::AbstractVector, n = 1)
[i == nothing ? missing : x[i] for i in indexin(t .- n, t)]
Then this function can be applied to any pair of variables in a Table, and can also be used within groups (@Nosferican).
The issue is that it would not work inside a formula.
Aye. From Slack #statistics, we are considering passing the indices (unit ID and temporal ID) as part of the schema
so the information is available for modelcols
. Those arguments could be handled at fit
rather than @formula
. That approach allows for generic data types.
Kinda related to #90, but not about supporting special table types, but coming at it from the other direction; supporting/defining special columns.
https://github.com/JuliaStats/StatsModels.jl/pull/108 introduced a very basic Lag, which is just based on getting the previous row. Doesn't know about time.
I propose a more general solution:
Index Terms
An
IndexTerm
represents some variable that we are going to use to base out Lag off.There would not be a direct element for an
IndexTerm
in a formula. But it can be inferred during apply schema, either via hints, or as the default forDateTime
,Period
(or from packages: TimeZones.jl'sZonedDateTime
, and Intervals.jl;sAnchoredInterval
. Need to workout how default hints for package types works.). And can also be created via the 3 argument form of Lag. See below.Lag: FunctionTerm
This comes in a 3 argument form
lag(main_term, step, index_term)
. E,g,lag(x, Day(3), release_date)
and a 2 argument formlag(main_term, step)
. The 2 argument form works out theindex_term
based on if there is only one column that the schema + hints agree should be the index. And if there are multiple it throws an ambiguity error.apply_schema
During
apply_schema
,LagTerm
works out what it'sindex_term
is, and what it'smain_term
is. (which thus lets it work out itswidth
etc).modelcol
During
modelcol
s, The lagterm callsmodelcols
on itsindex_term::IndexTerm
and itsmain_term
. It then runssortperm
on the resolved index colum. Then it generates the new lagged main colum by:It isn't exactly the fastest code, but it is robust against iregular spacing. Basically walk back through the rows as sorted by
index_col
and look to find on that islag.step
ago. Alternatives include detecting if data is regularly spaced (by something that has a surtifiny lowest common multiple with outlag.step
) and just jumping direct to the answer. Or making an indexDict
(Down side: doesn't work wiith approx, so troubles with Floats. (though small float integers tend to be fine))