Open github-actions[bot] opened 2 years ago
@scharlottej13 Hi! Continuing our discussion from #398
@scharlottej13 An interesting observation here, if the number of partitions of the original dataframe is less than or equal to the number of groups, then
x.name
works...?I was just coming to nearly the same conclusion! I wasn't able to break your snippet though, so maybe there's something else going on? I tried:
import pandas as pd import dask.dataframe as dd df = pd.DataFrame({'x': [1, 2] * 5, 'y': range(10)}) # change '3' to '2' in the list and the code will fail ddf = dd.from_pandas(df, npartitions=3) ddf.groupby('x').apply(lambda x: x.name, meta=('x', 'int64')).compute()
I was also using this example with strings instead to see if that made a difference:
import pandas as pd import dask.dataframe as dd n = 20 df = pd.DataFrame({'user': ['a', 'b', 'c'] * n, 'value1': [1, 2, 1] * n, 'value2': [20, 10, 20] * n}) ddf = dd.from_pandas(df, 4) ddf.groupby('user').apply(lambda x: x.name, meta=('user_result', str)).compute()
Interesting, both of these examples fail on my machine with AttributeError: 'DataFrame' object has no attribute 'name' 😕
dask version: 2022.01.1
pandas version: 1.4.0
The question poster has updated their example, and I don't think our original suggestion will work. A feature request for Pandas groups
(or __iter__
) might be the best solution.
The options for a workaround are a bit limited since set_index
and sort_values
only accept a single column (which is why map_partitions
, e.g., won't work). Here's a (not very good) snippet I came up with that returns close to what they want:
import pandas as pd
import dask.dataframe as dd
df = pd.DataFrame({'x': [1, 2, 3] * 10, 'y': range(30), 'z': ['a', 'b', 'c'] * 10})
ddf = dd.from_pandas(df, npartitions=4)
# sum is just an arbitrary function
test = ddf.groupby(['x', 'z']).aggregate(sum)
# what we really want is these index values
test['new_col'] = test.index.values
test.compute()
@pavithraes do you have any thoughts on this?
updated my reply, using index
seems better, but also encouraged them to submit a feature request
I am looking to either iterate over groups to get names of each group or to have an accessor to the group name in a groupby.apply().
First goal would be an answer like this: [python - Using groupby group names in function - Stack Overflow](https://stacko…
Would you like to know more?
Read the full article on the following website:
https://dask.discourse.group/t/how-to-get-groupby-group-names-with-dask-dataframes/298