apache / beam

Apache Beam is a unified programming model for Batch and Streaming data processing.
https://beam.apache.org/
Apache License 2.0
7.67k stars 4.19k forks source link

[Bug]: DataFrame`groupby` does not support named aggregation #27278

Open robmoore opened 1 year ago

robmoore commented 1 year ago

What happened?

Attempts to use a named aggregation in a groupby result in a TypeError (TypeError: DeferredGroupBy.agg() missing 1 required positional argument: 'fn').

Example case:

# Same error occurs when using explicit pd.NamedAggs instead of tuples
df.groupby(['quarter', 'program']).agg(total_spend=('revenue', 'sum'), avg_spend=('revenue', 'mean'))

Issue Priority

Priority: 3 (minor)

Issue Components

tvalentyn commented 11 months ago

Thanks for reporting, it should be possible to support this - would you be interested in taking a closer look and contributing a PR?

SiddharthJadhav99 commented 10 months ago

@tvalentyn can you assign this issue to me? I'll be able to support this.

tvalentyn commented 10 months ago

hi @SiddharthJadhav99 just checking if you have any questions or need help.

SiddharthJadhav99 commented 10 months ago

hey @tvalentyn & @robmoore, It would be really helpful if you could send a sample code which would replicate the error or if you could elaborate a little regarding this bug.

robmoore commented 10 months ago

@SiddharthJadhav99 Please see example in the pd.NamedAgg examples for Beam issue 27278 Colab notebook. The error is replicated in the section entitled "Example using Beam Interactive".

SiddharthJadhav99 commented 9 months ago

hey @tvalentyn, i tried to solve this issue but I am unable to do so. You may unassign me from this issue. thanks @robmoore for your cooperation and help!

artemyushko commented 8 months ago

.take-issue

vineetg3 commented 6 months ago

Hi @artemyushko , are you working this issue as of today?

tvalentyn commented 6 months ago

Given that we haven't heard from @artemyushko for a while I'll go ahead and unassign the issue. @artemyushko please don't hesitate to take it again if/when you plan to continue working on this.

artemyushko commented 6 months ago

Hi @tvalentyn , I looked into this issue a while ago, and it turns out that DataFrameGroupBy.groupby does not really support tuples the same way NamedAgg in pandas does, which I haven't figured a solution to. I have been trying to make DataFrameGroupBy represent the SQL call of f(column) as my_column_name, but I had no success. If anybody is willing to take this further, please feel free to!

vineetg3 commented 6 months ago

Hi @tvalentyn , do you think this should still be labelled good-first-issue? I am planning to take it up, but seems like this is a tough one.

tvalentyn commented 6 months ago

Thanks all, yes, it might be a bit more involved altough I haven't looked very closely. At minimum we should probably defer this until we finish adding pandas2 support, the work @caneff is doing now.