dask / dask-tutorial

Dask tutorial
https://tutorial.dask.org
BSD 3-Clause "New" or "Revised" License
1.83k stars 708 forks source link

01_dask_delayed groupby question #137

Closed wonderfreda closed 4 years ago

wonderfreda commented 4 years ago

While working on the exercise on this note book on NYC airport average delays, I ran into a question on how to correctly "marry" delayed with groupby function. If, instead of declaring a "groupby" variable in the code, I use delayed directly on the groupby:


sums = []
counts= []

for fn in filenames:
    df = delayed(pd.read_csv)(fn)

    total = delayed(sum)(df.groupby('Origin').DepDelay)
    count = delayed(count)(df.groupby('Origin').DepDelay)

    sums.append(total)
    counts.append(count)

sums, counts = compute(sums, counts)

Then this block of code gives me the error message:

TypeError: 'Series' object is not callable

This is again probably not directly related to the delayed function here but what would be the correct way to layer "delayed" function on top of groupby? Thank you very much!

mrocklin commented 4 years ago

You have a variable named count. Do you also have a function named count? Maybe having both of these with the same name is causing some confusion.

On Tue, Nov 12, 2019 at 7:34 AM wonderfreda notifications@github.com wrote:

While working on the exercise on this note book on NYC airport average delays, I ran into a question on how to correctly "marry" delayed with groupby function. If, instead of declaring a "groupby" variable in the code, I use delayed directly on the groupby:

sums = [] counts= []

for fn in filenames: df = delayed(pd.read_csv)(fn)

total = delayed(sum)(df.groupby('Origin').DepDelay)
count = delayed(count)(df.groupby('Origin').DepDelay)

sums.append(total)
counts.append(count)

sums, counts = compute(sums, counts)

Then this block of code gives me the error message:

TypeError: 'Series' object is not callable

This is again probably not directly related to the delayed function here but what would be the correct way to layer "delayed" function on top of groupby? Thank you very much!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dask/dask-tutorial/issues/137?email_source=notifications&email_token=AACKZTAAUIYB2CNFGCZWHGLQTLEGXA5CNFSM4JMFFDK2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HYXLLLQ, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTBT64R6VTPFHWA2P43QTLEGXANCNFSM4JMFFDKQ .

wonderfreda commented 4 years ago

Thanks for looking into this, but I don't think that's causing the error. I actually removed the "count" related line and simply kept the sum function, and got error when trying to compute the delayed output.

The code I ran is:

from dask import compute

sums = []

for fn in filenames:
    df = delayed(pd.read_csv)(fn)
    total = delayed(sum)(df.groupby('Origin').DepDelay)
    sums.append(total)

sums

And the output sums are indeed delayed (see the screen shot of output below):

image

However, when I try to "compute" the sums in the next step by invoking compute(sums), I got the error message:

image

Any suggestion on how to fix this (short of declaring a groupby variable first as shown in the solution) is greatly appreciated!

TomAugspurger commented 4 years ago

This looks a bit strange

>>>  total = delayed(sum)(df.groupby('Origin').DepDelay)

That's call the sum function on a pandas SeriesGroupBy object, which doesn't support that

In [8]: import dask.delayed

In [9]: import pandas as pd

In [10]: df = pd.DataFrame({"A":  [1, 2, 3, 4], "B": [0, 0, 1, 1]})

In [11]: sum(df.groupby("B").A)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-11-b8b47de6ecd0> in <module>
----> 1 sum(df.groupby("B").A)

TypeError: unsupported operand type(s) for +: 'int' and 'tuple'

Let me know if that explanation doesn't make sense.