[FEAT] Aggregations on List Types

samster25 commented 8 months ago

We should support the following aggregations on the list type name space

col('x').list.sum()

[x] Sum
[x] Mean
[x] Min
[x] Max
[ ] Distinct (List of unique elements)
[ ] Count_distinct (number of unique elements)
[ ] Flatten (Flattens List[List[T]] to List[T])

samster25 commented 8 months ago

@nsalerni Can you let me know if I missed anything?

nsalerni commented 8 months ago

@samster25 This covers a good chunk of the use case. Two others I can think of:

The one I can see missing from the above list would be apply() (i.e. being able to take some form of custom logic to a list column). It seems like that's covered by https://github.com/Eventual-Inc/Daft/issues/1976?
I'm not sure if the above would implicitly allow us to support the following, but this would be another simplified example of a use case I'd like to support:

df = daft.from_pydict({
    "strings": ["a", "b", "c", "d"],
    "lists": [[1, 1, 1, 1], [1, 1, 1, 1], [2, 2, 2], [2, 2, 2]],
})

df.groupby('lists').agg([
    (col("lists").alias("list_count"), 'count')
]).collect()

I'd imagine the output of this looking something like:

lists (Int64) | list_count (UInt64)
------------- | -----------------
[2, 2, 2]     |      2
[1, 1, 1, 1]  |      2

Today this yields:

PanicException: List(Int64) not implemented

kevinzwang commented 8 months ago

Hi @nsalerni ! I just made a new issue to track the work on grouping by list columns: https://github.com/Eventual-Inc/Daft/issues/1983

Eventual-Inc / Daft

[FEAT] Aggregations on List Types #1977