JuliaStats / StatsModels.jl

Specifying, fitting, and evaluating statistical models in Julia
248 stars 30 forks source link

Advanced Lead/Lag, via Index terms #109

Open oxinabox opened 5 years ago

oxinabox commented 5 years ago

Kinda related to #90, but not about supporting special table types, but coming at it from the other direction; supporting/defining special columns.

https://github.com/JuliaStats/StatsModels.jl/pull/108 introduced a very basic Lag, which is just based on getting the previous row. Doesn't know about time.

I propose a more general solution:

Index Terms

An IndexTerm represents some variable that we are going to use to base out Lag off.

There would not be a direct element for an IndexTerm in a formula. But it can be inferred during apply schema, either via hints, or as the default for DateTime, Period (or from packages: TimeZones.jl's ZonedDateTime, and Intervals.jl;s AnchoredInterval. Need to workout how default hints for package types works.). And can also be created via the 3 argument form of Lag. See below.

Lag: FunctionTerm

This comes in a 3 argument form lag(main_term, step, index_term). E,g, lag(x, Day(3), release_date) and a 2 argument form lag(main_term, step). The 2 argument form works out the index_term based on if there is only one column that the schema + hints agree should be the index. And if there are multiple it throws an ambiguity error.

apply_schema

During apply_schema, LagTerm works out what it's index_term is, and what it's main_term is. (which thus lets it work out its width etc).

modelcol

During modelcols, The lagterm calls modelcols on its index_term::IndexTerm and its main_term. It then runs sortperm on the resolved index colum. Then it generates the new lagged main colum by:

perm = sortperm(index_col)
map(1:size(main_cols, 2)) do ii

    cur_index = index_col[perm[ii]]
    prior_index = cur_index
    jj=ii
    while jj > 1 && cur_index - prior_index < lag.step
        prior_index = index_col[perm[jj]]
        jj-=1
    end
    if cur_index - prior_index ≈ lag.step
        return main_cols[perm[jj], :]
    else
        return missing
    endvalue
end

It isn't exactly the fastest code, but it is robust against iregular spacing. Basically walk back through the rows as sorted by index_col and look to find on that is lag.step ago. Alternatives include detecting if data is regularly spaced (by something that has a surtifiny lowest common multiple with out lag.step) and just jumping direct to the answer. Or making an index Dict (Down side: doesn't work wiith approx, so troubles with Floats. (though small float integers tend to be fine))

kleinschmidt commented 5 years ago

I wonder whether it's worth re-implementing what's basically a join here for this special case...could we use DataFrame's join or some such?

oxinabox commented 5 years ago

I guess this is kind like join. Main similarly is how no matching things are handled. But on a column that we don't instantiate index_col .- lag.

To do this with DataFrames.join we would need to instantiate that column, and put the main column into a DataFrame too. Which since it may be a matrix would likely end up requiring a copy and then gluing it back together and ... And also we would need to put it and the index column into another DataFrame that we join on.

I think this would be surprisingly frustrating to make work and slower than it should be. And would only work on DataFrames. And would not allow approx equality.

Idk of there is a generic Tables.jl join anywhere.

nalimilan commented 5 years ago

TableOperations.jl was supposed to host standard operations on generic tables. There's also QueryOperators.jl, which already provides join and is relatively lightweight. But it might not be worth worrying too much about duplication if the code for this particular join isn't too complex. That's one of the advantages of Julia.

See also https://github.com/Nosferican/EconUtils.jl/blob/751616c4fd630f97f14e6497517be05eb20575a1/src/firstdifference.jl

oxinabox commented 5 years ago

Anyway, I feel like the more debatable part of this is the whole IndexTerm idea, and having 2 argument lag work by deciding a column is an IndexTerm.

Rather then the details of how to use that index -- which we can always improve later; without a breaking change, since it is an internal detail

oxinabox commented 5 years ago

also how should we differenciate this from the Basic Lag in #108 ? I feel like having simple row-wise lag is also something we want, Maybe this should be called indexedlag? ilag? Or the one iin #108 basiclag or lagobs or lagrows? or structural_lag

nalimilan commented 5 years ago

I'm not a fan of automatic detection of "index" terms. I'd prefer them to be specified explicitly, either as an additional argument to lag (as you suggest; I don't think a separate function is needed), or as a kind of "meta" term (similar to what we discussed for weights or clusters, see https://github.com/JuliaStats/StatsModels.jl/issues/21). Maybe lag could default to using the index from the "meta" term if it's provided, but you would also be able to specify/override it via an argument.

oxinabox commented 5 years ago

I also think implict index is limitted it its use. The main use case is for things that have exactly 1 indexy looking thing, e,g, exactly one DateTime column. But basically all the data I work with has at least two: It has a release date, which is when the data became available, and it has the actual datetime column which is when the data is about. So for forecasts, release date will be before the actual datetime; and for historical data, there is often an embargo between them.

So maybe we don't go anywhere with index terms.

Nosferican commented 5 years ago

lag/lead/diff and other time operators have both a temporal/order index and a group index. For example, if a dataset has repeated observations per individual the lag should be performed by unit of observation and not leak observations from another individual for instance.

In Stata-lang, xtset (unit of observation identifier and a temporal identifier).

matthieugomez commented 5 years ago

One alternative is to simply define lag on AbstractVectors:

function lag(x::AbstractVector, t::AbstractVector, n = 1)
   [i == nothing ? missing : x[i] for i in indexin(t .- n, t)]

Then this function can be applied to any pair of variables in a Table, and can also be used within groups (@Nosferican).

The issue is that it would not work inside a formula.

Nosferican commented 5 years ago

Aye. From Slack #statistics, we are considering passing the indices (unit ID and temporal ID) as part of the schema so the information is available for modelcols. Those arguments could be handled at fit rather than @formula. That approach allows for generic data types.