Closed billylanchantin closed 4 months ago
This operation is doing three things at the moment:
It also has a limitation that it computes one single column. For example, instead we could have:
DF.transform(df, [columns: ["datetime_local", "timezone"]], fn row ->
[datetime_utc: row["datetime_local"]
|> DateTime.from_naive!(row["timezone"])
|> DateTime.shift_zone!("Etc/UTC")]
end)
We could also emit a custom row struct that accepts both strings and atom keys and converts fields as necessary. For example, imagine we had a %Explorer.DataFrame.Row{index: index, df: df}
. When you called row["datetime_local"]
, it would get that particular column and access it as index
. Does Polars guarantee constant time access to all of its rows? If it does, then we can provide both atom/string ergonomics and only convert the necessary keys lazily.
However, we should benchmark the approaches. The lazy one may end-up being less efficient if we do too many trips to Rust. We should certainly have a single operation to access a given column+row.
José I didn't want to use my brain today :P
It also has a limitation that it computes one single column.
👍 Yeah we could definitely get multiple columns with concat_columns
. Great suggestion.
EDIT: removed a comment about validation.
Does Polars guarantee constant time access to all of its rows?
I don't think so but I'm not sure. I couldn't find a definite answer in the docs.
They seem to support several kinds of index-based access and I'm not sure which is the "right" one. Following some source code led me to this file:
If this is the right place, I see several references to binary searches. That makes me think it's $O(k \cdot log(n))$. Maybe they can get good amortized performance?
However, we should benchmark the approaches. The lazy one may end-up being less efficient if we do too many trips to Rust. We should certainly have a single operation to access a given column+row.
Yeah definitely some benchmarks are in order. I suspect the most expensive part is the de-serialization step required to feed the Elixir functions. I'll try your lazy approach and get back with some numbers.
I also want to try and leverage Arrow's chunking. If de-serializing a single chunk is fast, it may be worth parallelizing over chunks on the Elixir side rather than trying to trick Polars into doing what we want. IDK how easy that level of control will be though.
My understanding from the Rust code is that they do a binary search only if there are several chunks. What we may want to do is to rechunk the dataframe before using it. Another potential concern here is doing the bounds check on every operation, but they do have an _unchecked
version.
Description
Adds
DF.transform/3
which is the analogous function toS.transform/2
. I've needed a version of this function many times in my own work.Example