Open martindurant opened 6 months ago
Thanks @martindurant
I think exploring integrations on a per-row level for our List
/FixedSizeList
/Tensor
types could be interesting. A slicing API for these types would be quite useful I think.
Does an integration require adoption of the awkward in-memory model though? That might be challenging because Daft has its own in-memory data representations as well.
As long as you can pass and consume [py]arrow, all is good! That's exactly what the polars integration does, which also uses arrow as the internal representation. And this would be per-row or per-partition operations.
I think exploring integrations on a per-row level for our List/FixedSizeList/Tensor
I should add, that the typical use case is for deeper nested things, containing lists and structs many levels down. You can already do some certain things with the existing list type. For example, a geometry object may be represented as a list of list of records array<[[x, y]]> or record of lists of lists array<[[x]], [[y]]>.
Our project at https://github.com/intake/awkward-pandas is building integrations to dataframe libraries, allowing vectorised processing of nested, variable-length data structures via awkward, i.e., deeper data types than usually handles by dataframe APIs.
Awkward itself boasts, aside from vectorised kernels:
It's internal memory model is very close to arrow and interops nicely.
As the name suggests, we trialed awkward-pandas for pandas first, but now we have integrations of various completeness:
These integrations follow the accessor pattern:
df.ak.*
orseries.ak.*
gives you the awkward namespace and slicing, e.g.:So, integrating with daft would be the first chance to bring this style of processing to Ray. As with dask.dataframe, making calls that work row-wise or partition-wise is easy; aggregations require first intra-partition operations, followed by some tree reduction; and inter-partition operations are simply hard. (The dask-awkward project shows we are working on this too, from the other direction. This would not be where to start any cooperation!)