elixir-explorer / explorer

Series (one-dimensional) and dataframes (two-dimensional) for fast and elegant data exploration in Elixir
https://hexdocs.pm/explorer
MIT License
1.13k stars 123 forks source link

Add `DF.transform` #912

Closed billylanchantin closed 4 months ago

billylanchantin commented 6 months ago

Description

Adds DF.transform/3 which is the analogous function to S.transform/2. I've needed a version of this function many times in my own work.

Example

alias Explorer.DataFrame, as: DF

df = DF.new(
  numbers: [1, 2],
  datetime_local: [~N[2024-01-01 00:00:00], ~N[2024-01-01 00:00:00]],
  timezone: ["Etc/UTC", "America/New_York"]
)

DF.transform(df, [names: ["datetime_local", "timezone"]], fn row ->
  datetime_utc =
    row["datetime_local"]
    |> DateTime.from_naive!(row["timezone"])
    |> DateTime.shift_zone!("Etc/UTC")

  %{datetime_utc: datetime_utc}
end)

# #Explorer.DataFrame<
#   Polars[2 x 4]
#   numbers s64 [1, 2]
#   datetime_local naive_datetime[μs] [2024-01-01 00:00:00.000000, 2024-01-01 00:00:00.000000]
#   timezone string ["Etc/UTC", "America/New_York"]
#   datetime_utc datetime[μs, Etc/UTC] [2024-01-01 00:00:00.000000Z, 2024-01-01 05:00:00.000000Z]
# >
josevalim commented 6 months ago

This operation is doing three things at the moment:

  1. selecting
  2. converting to rows
  3. merging the columns (which we call concat_columns)

It also has a limitation that it computes one single column. For example, instead we could have:

DF.transform(df, [columns: ["datetime_local", "timezone"]], fn row ->
  [datetime_utc: row["datetime_local"]
  |> DateTime.from_naive!(row["timezone"])
  |> DateTime.shift_zone!("Etc/UTC")]
end)

We could also emit a custom row struct that accepts both strings and atom keys and converts fields as necessary. For example, imagine we had a %Explorer.DataFrame.Row{index: index, df: df}. When you called row["datetime_local"], it would get that particular column and access it as index. Does Polars guarantee constant time access to all of its rows? If it does, then we can provide both atom/string ergonomics and only convert the necessary keys lazily.

josevalim commented 6 months ago

However, we should benchmark the approaches. The lazy one may end-up being less efficient if we do too many trips to Rust. We should certainly have a single operation to access a given column+row.

billylanchantin commented 6 months ago

José I didn't want to use my brain today :P

It also has a limitation that it computes one single column.

👍 Yeah we could definitely get multiple columns with concat_columns. Great suggestion.

EDIT: removed a comment about validation.

Does Polars guarantee constant time access to all of its rows?

I don't think so but I'm not sure. I couldn't find a definite answer in the docs.

They seem to support several kinds of index-based access and I'm not sure which is the "right" one. Following some source code led me to this file:

If this is the right place, I see several references to binary searches. That makes me think it's $O(k \cdot log(n))$. Maybe they can get good amortized performance?

However, we should benchmark the approaches. The lazy one may end-up being less efficient if we do too many trips to Rust. We should certainly have a single operation to access a given column+row.

Yeah definitely some benchmarks are in order. I suspect the most expensive part is the de-serialization step required to feed the Elixir functions. I'll try your lazy approach and get back with some numbers.

I also want to try and leverage Arrow's chunking. If de-serializing a single chunk is fast, it may be worth parallelizing over chunks on the Elixir side rather than trying to trick Polars into doing what we want. IDK how easy that level of control will be though.

josevalim commented 6 months ago

My understanding from the Rust code is that they do a binary search only if there are several chunks. What we may want to do is to rechunk the dataframe before using it. Another potential concern here is doing the bounds check on every operation, but they do have an _unchecked version.