Eventual-Inc / Daft

Distributed data engine for Python/SQL designed for the cloud, powered by Rust
https://getdaft.io
Apache License 2.0
2.31k stars 160 forks source link

[FEAT] Aggregations on List Types #1977

Open samster25 opened 8 months ago

samster25 commented 8 months ago

We should support the following aggregations on the list type name space

col('x').list.sum()
samster25 commented 8 months ago

@nsalerni Can you let me know if I missed anything?

nsalerni commented 8 months ago

@samster25 This covers a good chunk of the use case. Two others I can think of:

  1. The one I can see missing from the above list would be apply() (i.e. being able to take some form of custom logic to a list column). It seems like that's covered by https://github.com/Eventual-Inc/Daft/issues/1976?

  2. I'm not sure if the above would implicitly allow us to support the following, but this would be another simplified example of a use case I'd like to support:

df = daft.from_pydict({
    "strings": ["a", "b", "c", "d"],
    "lists": [[1, 1, 1, 1], [1, 1, 1, 1], [2, 2, 2], [2, 2, 2]],
})

df.groupby('lists').agg([
    (col("lists").alias("list_count"), 'count')
]).collect()

I'd imagine the output of this looking something like:

lists (Int64) | list_count (UInt64)
------------- | -----------------
[2, 2, 2]     |      2
[1, 1, 1, 1]  |      2

Today this yields:

PanicException: List(Int64) not implemented
kevinzwang commented 8 months ago

Hi @nsalerni ! I just made a new issue to track the work on grouping by list columns: https://github.com/Eventual-Inc/Daft/issues/1983