coiled / dask-community

Issue tracker for the Dask community team
MIT License
2 stars 0 forks source link

[Discourse] How to get groupby group names with Dask.Dataframes #397

Open github-actions[bot] opened 2 years ago

github-actions[bot] commented 2 years ago

I am looking to either iterate over groups to get names of each group or to have an accessor to the group name in a groupby.apply().

First goal would be an answer like this: [python - Using groupby group names in function - Stack Overflow](https://stacko


Would you like to know more?

Read the full article on the following website:

https://dask.discourse.group/t/how-to-get-groupby-group-names-with-dask-dataframes/298

pavithraes commented 2 years ago

@scharlottej13 Hi! Continuing our discussion from #398

@scharlottej13 An interesting observation here, if the number of partitions of the original dataframe is less than or equal to the number of groups, then x.name works...?

I was just coming to nearly the same conclusion! I wasn't able to break your snippet though, so maybe there's something else going on? I tried:

import pandas as pd
import dask.dataframe as dd

df = pd.DataFrame({'x': [1, 2] * 5, 'y': range(10)}) # change '3' to '2' in the list and the code will fail
ddf = dd.from_pandas(df, npartitions=3)
ddf.groupby('x').apply(lambda x: x.name, meta=('x', 'int64')).compute()

I was also using this example with strings instead to see if that made a difference:

import pandas as pd
import dask.dataframe as dd

n = 20
df = pd.DataFrame({'user': ['a', 'b', 'c'] * n,
                   'value1': [1, 2, 1] * n,
                   'value2': [20, 10, 20] * n})
ddf = dd.from_pandas(df, 4)
ddf.groupby('user').apply(lambda x: x.name, meta=('user_result', str)).compute()

Interesting, both of these examples fail on my machine with AttributeError: 'DataFrame' object has no attribute 'name' 😕 dask version: 2022.01.1 pandas version: 1.4.0

scharlottej13 commented 2 years ago

The question poster has updated their example, and I don't think our original suggestion will work. A feature request for Pandas groups (or __iter__) might be the best solution.

The options for a workaround are a bit limited since set_index and sort_values only accept a single column (which is why map_partitions, e.g., won't work). Here's a (not very good) snippet I came up with that returns close to what they want:

import pandas as pd
import dask.dataframe as dd

df = pd.DataFrame({'x': [1, 2, 3] * 10, 'y': range(30), 'z': ['a', 'b', 'c'] * 10})
ddf = dd.from_pandas(df, npartitions=4)
# sum is just an arbitrary function
test = ddf.groupby(['x', 'z']).aggregate(sum)
# what we really want is these index values
test['new_col'] = test.index.values
test.compute()

@pavithraes do you have any thoughts on this?

scharlottej13 commented 2 years ago

updated my reply, using index seems better, but also encouraged them to submit a feature request