JuliaArrays / ShiftedArrays.jl

Lazy shifted arrays for data analysis in Julia
Other
50 stars 10 forks source link

Lead/lag with respect to a time variable #37

Open matthieugomez opened 4 years ago

matthieugomez commented 4 years ago

Solve https://github.com/JuliaArrays/ShiftedArrays.jl/issues/13

matthieugomez commented 4 years ago

Any thought?

piever commented 4 years ago

Thanks for taking a stab at this!

My main concern is that ideally this would have a lazy implementation (just like lag, lead normally do), so that this line is executed on getindex.

I'm not sure whether it should be done by simply allowing ShiftedArray to have two arrays (keys and indices), or whether one could somehow get it to work using as parent array some custom array type which also stores its indices.

cc: @nalimilan (he is using this package in StatsKit, so he may have feedback on what is the best way to add time series support in a way that is helpful to the statistical ecosystem).

matthieugomez commented 4 years ago

I’m not sure it would be that ideal to have a lazy implementation of this lag function — no one wants a vector that is as slow as a dictionary. The trade-off does not seem to be worth it in this case.

The issue is that this package exports lag, so any method that uses base types needs to be in this package to avoid type piracy. On the other hand, I am not sure that ShiftedArray is the right type here.

On Mon, May 4, 2020 at 2:41 PM Pietro Vertechi notifications@github.com wrote:

Thanks for taking a stab at this!

My main concern is that ideally this would have a lazy implementation (just like lag, lead normally do), so that this line https://github.com/JuliaArrays/ShiftedArrays.jl/pull/37/files#diff-278249179b68fe78888f2d5e18630430R142 is executed on getindex.

I'm not sure whether it should be done by simply allowing ShiftedArray to have two arrays (keys and indices), or whether one could somehow get it to work using as parent array some custom array type which also stores its indices.

cc: @nalimilan https://github.com/nalimilan (he is using this package in StatsKit, so he may have feedback on what is the best way to add time series support in a way that is helpful to the statistical ecosystem).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/JuliaArrays/ShiftedArrays.jl/pull/37#issuecomment-623635901, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPPPXNBA4IGAN7OFJT73PDRP4D4LANCNFSM4MNL2TJA .

nalimilan commented 4 years ago

To me it sounds like this should be a specialization of lead and lag for TimeArrays instead. TimeArrays can be constructed by passing timestamps and values, so lag(TimeArray(times, values), Day(1)) could give you the same result as what this function does.

matthieugomez commented 4 years ago

How would that work with panel data though? ie using lag on a grouped DataFrame?

On Mon, May 4, 2020 at 3:05 PM Milan Bouchet-Valat notifications@github.com wrote:

To me it sounds like this should be a specialization of lead and lag for TimeArrays instead. TimeArrays can be constructed by passing timestamps and values, so lag(TimeArray(times, values), Day(1)) could give you the same result as what this function does.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/JuliaArrays/ShiftedArrays.jl/pull/37#issuecomment-623647440, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPPPXNVNQOHXGYHMBMRW23RP4GV3ANCNFSM4MNL2TJA .

nalimilan commented 4 years ago

Well GroupedDataFrame gives SubArray views of the columns for rows in the group, so you'd get a SubArray of a TimeArray and lag would be implemented by that type I guess.

matthieugomez commented 4 years ago

I don’t think that would work: time stamps of a TimeArray need to be unique.

On Mon, May 4, 2020 at 3:31 PM Milan Bouchet-Valat notifications@github.com wrote:

Well GroupedDataFrame gives SubArray views of the columns for rows in the group, so you'd get a SubArray of a TimeArray and lag would be implemented by that type I guess.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/JuliaArrays/ShiftedArrays.jl/pull/37#issuecomment-623660059, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPPPXNASFWD7DCGTS3RW7DRP4JX7ANCNFSM4MNL2TJA .

nalimilan commented 4 years ago

Ah you mean that you want duplicate times in the data frame that would be unique only within each group?

matthieugomez commented 4 years ago

Yes, exactly.

nalimilan commented 4 years ago

OK. Then the only solution would be to call lag(TimeArray(times, values), Day(1)) for each group inside combine/select. But that wouldn't have any advantage over adding methods like in this PR, so...

It's still annoying that lead and lag are currently lazy in ShiftedArrays. This makes sense for the current methods since it's very efficient, but for methods added by this PR the overhead is much higher. Being lazy can be useful even there if you go over the data only once, but otherwise collecting it will be faster.

This choice between lazy and eager is a broader issue that doesn't have a very good solution currently. Julia has filter and Iterators.filter, DataFrames has stack and stackview (unexported)... all of that is quite inconsistent.

By the way, if methods from this PR requires times to be sorted (at least by default), wouldn't they be able to work much more efficiently, just by checking whether the previous timestamp is equal to the current one minus period? Then it could make sense to be lazy. It may even be faster to sort the data (notably using radix sort) than to use a Dict.

piever commented 4 years ago

Concerning lazy versus eager: I agree that it only makes sense if the timestamps are sorted (or if one stores the sortperm in the lazy object). I'm actually not 100% sure that the lazy object should be an AbstractArray, or just an iterator. It feels like it is simpler to define iteration efficiently rather than getindex.

As a side comment, if this becomes a special method for TimeArray (which IMO makes a lot of sense), I think it should live in TimeSeries (ShiftedArrays is much more lightweight, so it would make more sense for TimeSeries to depend on it than viceversa).

matthieugomez commented 4 years ago

I don’t think it would work since you may want to lag by more than one period.

Actually, it looks like TimeArray actually has a lag function, which does not seem to do what I propose, and also does not inherit from shiftedarray....

On Mon, May 4, 2020 at 6:18 PM Pietro Vertechi notifications@github.com wrote:

Concerning lazy versus eager: I agree that it only makes sense if the timestamps are sorted (or if one stores the sortperm in the lazy object). I'm actually not 100% sure that the lazy object should be an AbstractArray, or just an iterator. It feels like it is simpler to define iteration efficiently rather than getindex`.

As a side comment, if this becomes a special method for TimeArray (which IMO makes a lot of sense), I think it should live in TimeSeries (ShiftedArrays is much more lightweight, so it would make more sense for TimeSeries to depend on it than viceversa).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/JuliaArrays/ShiftedArrays.jl/pull/37#issuecomment-623737869, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPPPXNQOGADIJCYUSTJCVLRP45MXANCNFSM4MNL2TJA .

matthieugomez commented 4 years ago

The other thing is that TimedArray does not work for dates in Int, which is common to hold years. This + the inability to have duplicate times makes it sound overly restrictive to be the only one type that can be lagged. @iblis17 On Mon, May 4, 2020 at 8:22 PM Matthieu Gomez gomez.matthieu@gmail.com wrote:

I don’t think it would work since you may want to lag by more than one period.

Actually, it looks like TimeArray actually has a lag function, which does not seem to do what I propose, and also does not inherit fro shiftedarray....

On Mon, May 4, 2020 at 6:18 PM Pietro Vertechi notifications@github.com wrote:

Concerning lazy versus eager: I agree that it only makes sense if the timestamps are sorted (or if one stores the sortperm in the lazy object). I'm actually not 100% sure that the lazy object should be an AbstractArray, or just an iterator. It feels like it is simpler to define iteration efficiently rather than getindex`.

As a side comment, if this becomes a special method for TimeArray (which IMO makes a lot of sense), I think it should live in TimeSeries (ShiftedArrays is much more lightweight, so it would make more sense for TimeSeries to depend on it than viceversa).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/JuliaArrays/ShiftedArrays.jl/pull/37#issuecomment-623737869, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPPPXNQOGADIJCYUSTJCVLRP45MXANCNFSM4MNL2TJA .

nalimilan commented 4 years ago

I don’t think it would work since you may want to lag by more than one period.

Is that really a problem? You can just check go back by one timestamp until you find one which is either equal or older than the current timestamp minus period. That should be efficient unless you want to lag by a large number of periods compared to the average resolution of the timestamps.

That said, I'm no longer sure that using TimeArray is the best approach. Indeed the proposed methods work for any type (notably integers), and not just dates/times.

Can you point at the corresponding APIs in other languages? That may give us some ideas.

iblislin commented 4 years ago

Hi @matthieugomez

I have a proposal for these functions:

Since the project goal of TimeSeries.jl is a versatile time series toolbox. I want to make some common functions available for AbstractArray, but I don't do it yet. :p

Oh, another issue: Is there a interface package for lag, lead, head, tail? I always got this:

WARNING: both TimeSeries and DataFrames export "tail"; uses of it in module Main must be qualified

And the same story will happen on TimeSeries and ShiftedArray.

matthieugomez commented 4 years ago

@nalimilan

Defining a new timed data frame type is certainly an option, although it would be much more complicated than what this pull request does. Another downside, AFIAK, is that it would not extend to other type of Tables.

As for your idea about sorted, yes that sounds like a great way to get laziness. Although every indexing would have to check that the whole vector is still sorted right? Stuff like checking missing obs would take a lot of times etc...

nalimilan commented 4 years ago

Thanks for the references. It would be useful if you could file an issue in DataFrames listing the operations that should be supported. That way we will be able to discuss what's the best solution to implement a time-aware type in/based on DataFrames.

As for your idea about sorted, yes that sounds like a great way to get laziness. Although every indexing would have to check that the whole vector is still sorted right? Stuff like checking missing obs would take a lot of times etc...

I guess you could assume that the vector of timestamps isn't mutated. That's a natural expectation for a lazy view.

Oh, another issue: Is there a interface package for lag, lead, head, tail? I always got this:

WARNING: both TimeSeries and DataFrames export "tail"; uses of it in module Main must be qualified

And the same story will happen on TimeSeries and ShiftedArray.

@iblis17 head and tail are deprecated in DataFrames, so they will be removed at some point. FWIW, we moved to using first and last so maybe that would be an option for TimesSeries too?

For lag and lead, we could define them in DataAPI, but it's difficult to coordinate multiple packagesif they want to defined methods that operate on types they don't own, which is quite typical for theese functions (like ShiftedArrays does). That would require defining very clearly which package is supposed to provide which method.

iblislin commented 4 years ago

@nalimilan

head and tail are deprecated in DataFrames, so they will be removed at some point. FWIW, we moved to using first and last so maybe that would be an option for TimesSeries too?

:+1: I will check them and try to move to first and last.

For lag and lead, we could define them in DataAPI, but it's difficult to coordinate multiple packagesif they want to defined methods that operate on types they don't own, which is quite typical for theese functions (like ShiftedArrays does). That would require defining very clearly which package is supposed to provide which method.

Oh, okay. I just read the doc of ShiftedArrays. Seem the "just-let-them-conflict" policy is suitable for the types not owned by the package.

matthieugomez commented 4 years ago

@nalimilan that seems way too error prone too me. I think the best option would be to rename the lag function in this package to shift, which is more consistent with the name of this package anyway. lag would be reserved for a time non lazy lag.

FuZhiyu commented 3 years ago

Thanks @matthieugomez for working on this! This functionality has been hugely (also surprisingly) missing outside of Stata-verse. Here are my two cents on it:

I don’t know Panda that well but it seems that the shift function corresponds to the lag function in dplyr

Pandas now support shift by frequency. Not sure about the underlying implementation but it's pretty slow at least on quarterly series. Also, it's not compatible with MultiIndex in pandas which is also unfriendly to panel data. This leaves me wondering after so many years how people deal with panel data outside of Stata.

I like the current implementation to take another argument of another time index. Since there potentially be multiple columns on the same timestamps, it'll be redundant to attach every column with time information as in TimeArray. Another approach would be to define a TimeDataFrame but I also agree with @matthieugomez on the downside of that path. The current implementation is more flexible and simple.

Concerning sorting and laziness: I find it totally reasonable to assume the time index has been sorted. In most of the applications we have sorted time indexes anyways. Operating on a sorted index is much more efficient as you only need to check n-steps back, so I tend to think it should be the default algorithm while the dictionary-based approach can be a fallback. In Stata/Pandas the lag/lead operators also require the data to be sorted by time indexes (the time variable in their applications are literally the indexes of the data so it's admittedly less error-prone).

Also consistent with the current method of this package, the lag&lead are relative to the index of the arrays. This should give the user a hint that any other indexes they are passing to the time-aware method better to be sorted.

eloualiche commented 2 years ago

Bump! It would be great to be able to use this package for (unbalanced) panel data.

I don't think everybody is only focused on time series ;)

matthieugomez commented 2 years ago

The sorted-lazy approach won’t work for panel data — how would one check that the vector is sorted within each group? Running issorted at each getindex call would be too costly.

eloualiche commented 2 years ago

The sorted-lazy approach won’t work for panel data — how would one check that the vector is sorted within each group? Running issorted at each getindex call would be too costly.

Agreed! I did not think of that. This is a great insight!

nalimilan commented 2 years ago

The sorted-lazy approach won’t work for panel data — how would one check that the vector is sorted within each group? Running issorted at each getindex call would be too costly.

@matthieugomez I don't understand this point. I guess you would do something like combine(groupby(df, :key), [:x, :times] => lag => :x_lagged) right? Then lag just has to check that the SubArray it gets passed for times (corresponding to a particular group) is sorted (this is super cheap).

EDIT: Incidentally, this is how PanelShift.jl implements this, even if it's not lazy, so that's probably an efficient strategy in general. Cc: @FuZhiyu

FuZhiyu commented 2 years ago

Laziness vs being sorted seems two orthogonal issues. We need :times to be sorted, while we probably want :x_lagged to be lazy. As long as times is not lazy, checking whether it's sorted seems pretty cheap.

As for PanelShift.jl, I didn't implement the lazy approach because 1) I'm not familiar with the machinery of laziness 2) Usually we don't have a need for a lazy lagged array in panel data.

But if the maintainer of ShiftedArrays.jl thinks it's a good idea to incorporate PanelShifts into this package I'm also happy to dig into the lazy wood to make it lazy.

matthieugomez commented 2 years ago

Is it really cheap to check that times is sorted every time getindex(:x_lagged, i) is called?

FuZhiyu commented 2 years ago

ok I see the issue. If for laziness we are not allowed to store the new index for x_lagged, then indeed we need to check whether times is sorted every time we call getindex.

In the case where there are gaps in time arrays, we could either store the whole new index in the ShiftedArray, or store multiple shifts and use them depending on the queried index. But both solutions require different designs of ShiftedArray and seem a little bit awkward.

nalimilan commented 2 years ago

My point is that we could check whether times are sorted only when creating the object. Then getindex would simply assume that the underlying times are not mutated. Am I missing something?

matthieugomez commented 2 years ago

I guess, pragmatically, I am not sure what would be returned by this command:

df = DataFrame(times = [0, 1], x = ["V1", "V2"])
df.y = lag(df.x, df.times)
sort!(df, :times, rev = true)
df.y
matthieugomez commented 2 years ago

@nalimilan The other thing is that, assuming that the underlying times are not mutated seems very error prone to me. Safety seems more important than memory efficiency.

FuZhiyu commented 2 years ago

I now feel lazy vectors may not be a good idea for time vectors with gaps. Let me clarify.

The current type ShiftedArray stores shifts, so when getindex(:x_lagged, i) is called, it returns x_lagged[i-shift]. However, when there are gaps in time, shifts are no longer constant so the current design cannot be applied directly. Say we have x_lagged=lag(time, x, 1), where x is a length-N vector We can do:

  1. Store a length-N vector of shifts in x_lagged, so when we call getindex(:x_lagged, t), it returns x_lagged[t-shift[t]]. But since we already saved another shift vector, what's the benefit of being lazy?
  2. A true lazy approach, so every call of getindex(x_lagged, t), as @matthieugomez suggests, checks whether time is sorted and searches for x at t-1. Super safe, super inefficient.

Both approaches are safe for post-creation mutation to the time vector though. but neither seem to have an advantage of a non-lazy method.

Or is there any other implementation for lazy shiftedarrays that I haven't thought of?

piever commented 2 years ago

I think it may be important to give some perspective from the point of view of the ShiftedArrays library on this type of functionality.

In my understanding, the main goal of this package is to provide a lazy, shifted array abstraction (ShiftedArray and CircShiftedArray) and, whenever possible, add lazy, memory-efficient versions of established functions (Base.circshift, AbstractFFTs.fftshift).

lag and lead are exceptions to this, and at the moment they are mostly helpers for the ShiftedArray function, i.e., lag(v, n) = ShiftedArray(v, n), lead(v, n) = ShiftedArray(v, -n). They are not crucial, in that the constructor already does most of the job.

I don't think adding additional eager lag and lead methods (based on a Dict) here is consistent with the rest of the library. It would also increase the maintenance burden for this package, which at the moment is quite low, as it is a low-level package with no dependencies. I'm afraid that would change if ShiftedArrays became more of a "toolkit to work with sequence data".

IMO, possible ways forward would be the following.

  1. Unexport lag and lead from ShiftedArrays. A separate package (StatsBase.jl, or a hypothetical WindowsFunctions.jl) would define eager versions, with all the required features. ShiftedArrays.lag would be a lazy version of WindowsFunctions.lag whenever possible. Same could be done for the diff proposal in #51 (WindowsFunctions.diff is eager, ShiftedArrays.diff is lazy).
  2. Same as 1., but lag and lead are removed altogether from here, the ShiftedArray constructor is probably enough (see https://github.com/JuliaArrays/ShiftedArrays.jl/pull/37#issuecomment-624680509).
  3. Do not change anything in ShiftedArrays, but use instead a custom vector type and specialize lag(CustomVector(times, values), deltat). I think CustomVector should be some sort of AxisArray, with a list of sorted indices and a list of values (see https://github.com/JuliaArrays/ShiftedArrays.jl/pull/37#issuecomment-623647440).
nalimilan commented 2 years ago

Thanks for commenting. I agree it's reasonable to keep ShiftedArrays oriented towards lazy operations only, and put eager methods in another package.

I'd tend to prefer your solutions 1 or 2. I actually think it would make sense to define empty lead and lag functions in StatsAPI so that other packages can provide appropriate methods for custom array types (as long as they are consistent with the general definition). Fallback definitions for any AbstractVectors, which would be eager, could live in StatsBase (if others agree).

matthieugomez commented 2 years ago

I would love Solution 1 or 2

mkitti commented 2 years ago

Option 1 is the best choice. You can always remove them later as in Option 2.