elixir-explorer / explorer

Series (one-dimensional) and dataframes (two-dimensional) for fast and elegant data exploration in Elixir
https://hexdocs.pm/explorer
MIT License
1.07k stars 118 forks source link

exposing the `fold` expressions from Polars #911

Open mhanberg opened 2 months ago

mhanberg commented 2 months ago

Description

Is it possible to expose the folds API from Polars?

I have a problem that I think can be solved via that API (I'm not entirely sure, still a beginner with Explorer).

I can try to think of a version of my problem that I can publicly if needed.

Also, if this API is already exposed and I just missed it... please let me know 😅.

Thanks!

billylanchantin commented 2 months ago

It is not exposed (unless I missed it too!). I think it'd be a great addition. Though it looks like it'd be a good deal of work to add it so it might take a while.

I have a problem that I think can be solved via that API (I'm not entirely sure, still a beginner with Explorer). I can try to think of a version of my problem that I can publicly if needed.

If you want to ask on elixirforum.com, feel free to @- me and I can try to answer. My handle is the same as on GitHub.

josevalim commented 2 months ago

Oh, I didn't know we had fold. It seems it works with expressions, which means we can use the structure in Explorer.QUery to fold over anything and it will be performant. I don't think it would be that complicated then! My suggestion is to call it reduce_with, to mirror it map_with and friends!

billylanchantin commented 2 months ago

So it seems there's fold_exprs and reduce_exprs. The difference seems to be reduction col-wise vs. row-wise. I think we'd want to include both?

They also have a few exprs pairs like sum and sum_horizontal. Maybe we want to call them reduce_with and reduce_with_horizontal? reduce and fold are basically synonyms to me.

Also looking over the docs, I think there's a lot of potential in exposing many of their exprs:

josevalim commented 2 months ago

Sorry, I got fold and reduce mixed up. If it is operating on the columns themselves, then we can probably add it to Explorer.Query directly. We already support column traversal via across/query.

I am more interested in the reduce version that works within a single column.

billylanchantin commented 2 months ago

I am more interested in the reduce version that works within a single column.

Yeah agreed! It'd be super useful in summarise.

We already support column traversal via across/query.

If I'm reading this correctly (I've not confirmed it yet), then the reduce_with_horizontal reduces across the columns:

df = DF.new(a: [1, 2, 3], b: [10, 20, 30], c: [100, 200, 300])

+--------------------------------------------+
| Explorer DataFrame: [rows: 3, columns: 3]  |
+--------------+--------------+--------------+
|      a       |      b       |      c       |
|    <s64>     |    <s64>     |    <s64>     |
+==============+==============+==============+
| 1            | 10           | 100          |
+--------------+--------------+--------------+
| 2            | 20           | 200          |
+--------------+--------------+--------------+
| 3            | 30           | 300          |
+--------------+--------------+--------------+

mutate(df, sum: reduce_horizontal(cols(), 0, fn col, acc ->
  col + acc
end))

+-------------------------------------------+
| Explorer DataFrame: [rows: 3, columns: 4] |
+----------+----------+----------+----------+
|    a     |    b     |    c     |   sum    |
|  <s64>   |  <s64>   |  <s64>   |  <s64>   |
+==========+==========+==========+==========+
| 1        | 10       | 100      | 111      |
+----------+----------+----------+----------+
| 2        | 20       | 200      | 222      |
+----------+----------+----------+----------+
| 3        | 30       | 300      | 333      |
+----------+----------+----------+----------+

Our comprehensions only make the same call to mutate/filter/etc. with different columns more ergonomic. This would let you actually use compute multi-column things.

In fact, I wonder if we could make the :reduce option to for syntactic sugar for this?... 🤔

josevalim commented 2 months ago

In fact, I wonder if we could make the :reduce option to for syntactic sugar for this?... 🤔

We certainly could but perhaps @cigrainger has ideas on the API for this. @cigrainger, can we "fold" across columns in dplyr?

jsonbecker commented 1 month ago

The equivalent in dplyr would be accomplished with something like this:

df
|> mutate(sum(c_across(starts_with("Bud")))

It's kind of gross, but quite similar to mutate(df, sum: reduce_horizontal(...))

There used to be a rowwise() wrapper that also felt a bit off.