Eventual-Inc / Daft

Distributed DataFrame for Python designed for the cloud, powered by Rust
https://getdaft.io
Apache License 2.0
2.08k stars 141 forks source link

[FEAT] Basic global expressions within expression tree #1979

Open kevinzwang opened 6 months ago

kevinzwang commented 6 months ago

Support for use of global expressions in any part of the expression tree. Enables use of global expressions in in all remaining dataframe operations such as select, with_column, and filter, as well as more complicated usages such as df.agg((col('x')-col('x').mean()).sum())

dsaad68 commented 4 months ago

We didn't have this problem on version 0.2.19; we have tried to bump from 0.2.19 to 0.2.23 or 0.2.23 on Python version 3.11.9, but suddenly this error is raised:

cuallee\__init__.py:1199: in validate
    return self.compute_engine.summary(self, dataframe)
cuallee\daft_validation.py:513: in summary
    unified_results = {
cuallee\daft_validation.py:514: in <dictcomp>
    rule.key: [operator.methodcaller(rule.method, rule, dataframe)(compute)]
cuallee\daft_validation.py:98: in has_min
    return dataframe.select(perdicate).to_pandas().iloc[0, 0] == rule.value
.venv\Lib\site-packages\daft\api_annotations.py:26: in _wrap
    return timed_method(*args, **kwargs)
.venv\Lib\site-packages\daft\analytics.py:189: in tracked_method
    result = method(*args, **kwargs)
.venv\Lib\site-packages\daft\dataframe\dataframe.py:662: in select
    builder = self._builder.select(self.__column_input_to_expression(columns))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = * Source:
|   Number of partitions = 1
|   Output schema = id#Int32

to_select = [min(col(id))]

    def select(
        self,
        to_select: list[Expression],
    ) -> LogicalPlanBuilder:
        to_select_pyexprs = [expr._expr for expr in to_select]
>       builder = self._builder.select(to_select_pyexprs)
E       daft.exceptions.DaftCoreException: DaftError::ValueError Aggregation expressions are not currently supported in project: min(col(id))
E       If you would like to have this feature, please see https://github.com/Eventual-Inc/Daft/issues/1979#issue-2170913383

I have documented our case here.

kevinzwang commented 4 months ago

Hi @dsaad68! Thanks for bringing this up to us. Eventually we will support this pattern, but currently, if you use an aggregation expression outside of an agg, it may not actually have the behavior you intend, since it applies the expression to each partition individually instead of the entire column. That is why I disabled it and added that error message.

A valid way to do something like computing the min of a column would be either dataframe.select(predicate).min() as you mentioned, or if you still want to use aggregation expressions, you can put them in an agg, like

dataframe.agg(predicate.min())

Hope that helps, and glad to see that there is interest in support for this kind of operation!

universalmind303 commented 1 month ago

couldn't we evaluate the aggregate, then repeat the value over the other column's length?

from daft import col
import daft

df = daft.from_pydict({"a": [1, 2, 3], "b": [4, 5, 6]})

df = df.select(
  col('a').sum().alias('sum'),
  col('a')
).collect()

print(df)
┌─────┬─────┐
│ sum ┆ a   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 6   ┆ 1   │
│ 6   ┆ 2   │
│ 6   ┆ 3   │
└─────┴─────┘
jaychia commented 1 month ago

couldn't we evaluate the aggregate, then repeat the value over the other column's length?

from daft import col
import daft

df = daft.from_pydict({"a": [1, 2, 3], "b": [4, 5, 6]})

df = df.select(
  col('a').sum().alias('sum'),
  col('a')
).collect()

print(df)
┌─────┬─────┐
│ sum ┆ a   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 6   ┆ 1   │
│ 6   ┆ 2   │
│ 6   ┆ 3   │
└─────┴─────┘

Yes we could, but I think it would require a bit of logical plan rewriting and maybe a broadcast logical plan

Project([col('a').sum().alias('sum'), col("a")])

Project([col("a")]) ------------------------------------>  
Project([col('a').sum().alias('sum')]) --- broadcast -----/

We don't quite yet have this machinery, but shouldn't be difficult to do I believe