Eventual-Inc / Daft

Distributed DataFrame for Python designed for the cloud, powered by Rust
https://getdaft.io
Apache License 2.0
1.79k stars 108 forks source link

Accessor registration #2253

Open martindurant opened 3 weeks ago

martindurant commented 3 weeks ago

Our project at https://github.com/intake/awkward-pandas is building integrations to dataframe libraries, allowing vectorised processing of nested, variable-length data structures via awkward, i.e., deeper data types than usually handles by dataframe APIs.

Awkward itself boasts, aside from vectorised kernels:

It's internal memory model is very close to arrow and interops nicely.

As the name suggests, we trialed awkward-pandas for pandas first, but now we have integrations of various completeness:

These integrations follow the accessor pattern: df.ak.* or series.ak.* gives you the awkward namespace and slicing, e.g.:

df = ...
df["col1"].ak.sum(axis=2) # performs awkward-style sum on inner axis, returning series
df.ak[..., -1] # select last item in inner most list of every column

So, integrating with daft would be the first chance to bring this style of processing to Ray. As with dask.dataframe, making calls that work row-wise or partition-wise is easy; aggregations require first intra-partition operations, followed by some tree reduction; and inter-partition operations are simply hard. (The dask-awkward project shows we are working on this too, from the other direction. This would not be where to start any cooperation!)

jaychia commented 2 weeks ago

Thanks @martindurant

I think exploring integrations on a per-row level for our List/FixedSizeList/Tensor types could be interesting. A slicing API for these types would be quite useful I think.

Does an integration require adoption of the awkward in-memory model though? That might be challenging because Daft has its own in-memory data representations as well.

martindurant commented 2 weeks ago

As long as you can pass and consume [py]arrow, all is good! That's exactly what the polars integration does, which also uses arrow as the internal representation. And this would be per-row or per-partition operations.

martindurant commented 2 weeks ago

I think exploring integrations on a per-row level for our List/FixedSizeList/Tensor

I should add, that the typical use case is for deeper nested things, containing lists and structs many levels down. You can already do some certain things with the existing list type. For example, a geometry object may be represented as a list of list of records array<[[x, y]]> or record of lists of lists array<[[x]], [[y]]>.