Rolling window / group_by_dynamic() functionality

ahacking commented 1 year ago

I have been experimenting with this awesome library in livebook, what a great combination!

One of the features I was not able to find despite reading all the documentation was the concept of rolling windows. I did see the window functions for canned aggregations but nothing that compares to Polars groupby_dynamic.

I basically hit the wall quite hard with Explorer on what appears to be a missing capability, as I do need to compute derived values on a rolling window basis similar to what we can achieve with regular group_by().

Is dynamic grouping/windowing on the roadmap for Explorer?

Is there another way to achieve a similar outcome? Nx?

Thanks for the great work on this so far, it absolutely murders python pandas for performance and usability with a much cleaner API surface that is incredibly easy to reason about.

josevalim commented 1 year ago

Hi @ahacking, can you expand a bit on the exact problem that you have?

A rolling query could be written as:

df
|> DF.mutate(year: year(date))
|> DF.group_by([:year])
|> DF.mutate(mean: window_mean(close, 2))

However, we are missing the year function. It could be added in a similar way that #480 was added. :)

On the other hand, if you want a window every 1y6mon then it is a bit more complicated. We would need a function that returns months from epoch and then group them based on that. So it looks doable with our verbs, but more work and base functions will be necessary. We will be happy to provide guidance. Our goal is to mirror what dplyr provides here: https://dplyr.tidyverse.org/articles/window-functions.html

ahacking commented 1 year ago

Thanks Jose, just to clarify my use case is not date based but simply a sliding window of n rows.

The actual problem I have is to calculate on a running row by row basis the number of distinct values within the previous n rows (the sliding window).

The lead() and lag() approach of dplyr would probably work as I could base the mutation on n_distinct(lag(x,period)).

The Polars dynamic group also seemed to fit nicely with the group_by concept with every, period and offset options, but only if it can be applied on top an existing static grouping.

On a related note I also tried to approach a similar problem from a number of angles, and thought perhaps I could make use of a monotonic index but I didn't see an obvious way to generate a series from a range without generating an intermediate list. My dataset has tens of millions of rows so creating an intermediate list seemed like an inefficient approach compared to say Series.from_range(). Close to this use case was creating a series of a defined length from a fixed value of a defined size/length, eg a series of '1's, perhaps something similar to Nx.tile()

josevalim commented 1 year ago

So it sounds like focusing on lead() and lag() would help us unblock the next batch of functions for tackling those problems. Do you mind if we rename the issue to say to say "Add lead() and lag()"? Once we add these two it may be easier to add the rest.

ahacking commented 1 year ago

Sounds great to me!

ahacking commented 1 year ago

I had a much closer look further at the semantics in dplyr and I don't believe lead and lag offer what is required. It is basically some macros around Series.shift() which we already have. I would note that an option to provide the padding value to shift would be a useful addition.

What I am seeking is the ability to have a series based on a rolling window where the number of rows in a window would be determined by a windowing definition, either number of rows or some range, eg a date/time range.

Polars _groupbydynamic provides a mechanism to achieve the missing rolling window functionality.

Another alternative would be functionality similar to _windowmax, e.g just Series.window(series, window_size, opts \ []) which returns a rolling window on a series which could then be passed/piped to any series function which would produce another series of the same length as the origin series, after applying the function. I would also suggest a window option for padding as suggested for Series.shift() and lead / lag.

I could look at working up a PR, but I'd like your thoughts on the feasibility and the preferred API first.

josevalim commented 1 year ago

Thanks @ahacking. In this case we need an opinion from dplyr experts such as @cigrainger and @kimjoaoun.

@cigrainger and @kimjoaoun, is this computation supported by dplyr? If not, then how it should look like here?

In case you don't want to read the whole issue, here is the crux of the problem:

What I am seeking is the ability to have a series based on a rolling window where the number of rows in a window would be determined by a windowing definition, either number of rows or some range, eg a date/time range.

What I could find online is this: https://stackoverflow.com/questions/42960646/calculate-stats-based-on-dynamic-window-using-dplyr

It seems to make sense, they filter the operation against the dataset itself. I took a quick look at groupby_dynamic and it seems to be doing something similar. But I am not sure if we support it just yet. @philss, what happens if we do a lazy query and on operations such as left < right, one side is lazy and the other is eager?

Another concern is that Polars is essential to us but also an implementation detail, so whatever API we provide, we don't want to tightly couple it to Polars. I understand why they call it groupby_dynamic, but the grouping functionality in dplyr/explorer is more expressive than the groups in Polars, so we need to be really careful with the semantics. On the other hand, our groups internally are precisely keeping several groups in memory, so we may be able to support this.

philss commented 1 year ago

what happens if we do a lazy query and on operations such as left < right, one side is lazy and the other is eager?

@josevalim in some operations we can convert the eager series to an expression using https://pola-rs.github.io/polars/polars/prelude/struct.Series.html#method.lit We do that with the function Explorer.PolarsBackend.Native.expr_series/1. So internally they work like they were lazy.

josevalim commented 1 year ago

@philss we can… but we do not right? Should we open up an issue for that?

philss commented 1 year ago

@josevalim I think it will work for most of the operations that accept series. Actually I don't know in which case it would not work. An example:

require Explorer.DataFrame, as: DF

df = DF.new(a: [1, 2, 3, 4, 5], b: ["a", "b", "c", "d", "e"])

s = Explorer.Series.from_list([1, 3, 2, 4, 6])
DF.filter(df, a < ^s)

#Explorer.DataFrame<
  Polars[2 x 2]
  a integer [2, 5]
  b string ["b", "e"]
>

And with the opposite:

DF.filter(df, ^s < a)

#Explorer.DataFrame<
  Polars[1 x 2]
  a integer [3]
  b string ["c"]
>

lessless commented 2 months ago

Hey folks! Any progress on this?

josevalim commented 2 months ago

@lessless what do you exactly you need? We already have lead and lag via shift: https://hexdocs.pm/explorer/Explorer.Series.html#shift/2

elixir-explorer / explorer

Rolling window / group_by_dynamic() functionality #493