elixir-explorer / explorer

Series (one-dimensional) and dataframes (two-dimensional) for fast and elegant data exploration in Elixir
https://hexdocs.pm/explorer
MIT License
1.1k stars 120 forks source link

Missing Window Functions from Polars #589

Open guarilha opened 1 year ago

guarilha commented 1 year ago

Polars has the following "rolling" (explorer calls it window) functions:

Although some of the rolling computations are already available in Explorer, the absence of a rolling_apply equivalent makes it less convenient to calculate certain statistics and models.

As a result, users are forced to resort to workarounds that are far from ideal. For example, one could calculate the 21-day rolling standard deviation by using additional columns and a combination of existing functions. However, for users familiar with Pandas, this approach can feel unusual.

Is there any plan to support rolling_apply in Explorer, or am I overlooking something?

josevalim commented 1 year ago

I don't think we can support rolling_apply because it is not possible to call Erlang from C/Rust without using message passing. As far as I see, the python version linked is fully implemented in C.

Can you provide a more concrete example that you are trying to address and how you are addressing it? Perhaps we can provide higher level conveniences without having it named rolling_apply itself?

guarilha commented 1 year ago

Sure!

I have a dataframe with daily returns from stocks and i need the 21-day rolling window of volatility (std dev) and correlation among these series.

My initial solution was similar to this:

require Explorer.Series, as: S

df = Explorer.Datasets.iris()
window_size = 3

S.to_enum(df[:sepal_length])
|> Enum.reduce({[], []}, fn e, {head, acc} ->
  head = head ++ [e]

  acc =
    if Enum.count(head) < window_size do
      acc ++ [nil]
    else
      acc ++
        [
          head
          |> Enum.reverse()
          |> Enum.take(window_size)
          |> S.from_list()
          |> S.standard_deviation()
        ]
    end

  {head, acc}
end)
|> elem(1)

I'm presenting this here so that the journey of how to implement this is documented as well, hope it helps. This would work for smaller dataframes, but performance would take a huge hit on larger ones.

So we got to a solution that looks like this:

df = Explorer.Datasets.iris()
window_size = 3
max_offset = S.size(df[:sepal_length]) - window_size

0..max_offset
|> Stream.map(&S.slice(df[:sepal_length], &1, window_size))
|> Stream.map(&S.standard_deviation/1)
|> Stream.chunk_every(1)
|> Stream.map(&S.from_list/1)
|> Enum.reduce(S.from_list([]), &S.concat(&2, &1))

If you have any pointers on this approach it would be of great help.

Some things I need to calculate over rolling windows:

Thanks!

josevalim commented 1 year ago

Maybe we could have a Series.window_map(series, callback) function? The callback receives sliced series and it must numbers something that we can convert to a series again later?

Btw, I think your implementation could be:

0..max_offset
|> Stream.map(&S.slice(df[:sepal_length], &1, window_size))
|> Stream.map(&S.standard_deviation/1)
|> Enum.to_list()
|> S.from_list()

but i am not sure.

josevalim commented 1 year ago

Would you like to send a PR for Series.window_map btw?

guarilha commented 1 year ago

Created this PR to explore a bit the codebase and test the waters. Waiting for review on it to make sure everything is ok. After that I plan on adding a bunch of functions that I need as well. Hope it helps.

mrcwinn commented 1 year ago

Hi, I also have a use case for this. Here's equivalent code in Python:

df['atl'] = df['tss'].rolling(window=7).apply(lambda x: calculate_atl_recursive(x))

I can't solve this with the current package API, unless I'm missing something.

Thank you!

EDIT: I just realized who I'm in a thread with (famous people). Extra thank you for all your work.