Closed wonderfreda closed 4 years ago
You have a variable named count. Do you also have a function named count? Maybe having both of these with the same name is causing some confusion.
On Tue, Nov 12, 2019 at 7:34 AM wonderfreda notifications@github.com wrote:
While working on the exercise on this note book on NYC airport average delays, I ran into a question on how to correctly "marry" delayed with groupby function. If, instead of declaring a "groupby" variable in the code, I use delayed directly on the groupby:
sums = [] counts= []
for fn in filenames: df = delayed(pd.read_csv)(fn)
total = delayed(sum)(df.groupby('Origin').DepDelay) count = delayed(count)(df.groupby('Origin').DepDelay) sums.append(total) counts.append(count)
sums, counts = compute(sums, counts)
Then this block of code gives me the error message:
TypeError: 'Series' object is not callable
This is again probably not directly related to the delayed function here but what would be the correct way to layer "delayed" function on top of groupby? Thank you very much!
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dask/dask-tutorial/issues/137?email_source=notifications&email_token=AACKZTAAUIYB2CNFGCZWHGLQTLEGXA5CNFSM4JMFFDK2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HYXLLLQ, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTBT64R6VTPFHWA2P43QTLEGXANCNFSM4JMFFDKQ .
Thanks for looking into this, but I don't think that's causing the error. I actually removed the "count" related line and simply kept the sum function, and got error when trying to compute the delayed output.
The code I ran is:
from dask import compute
sums = []
for fn in filenames:
df = delayed(pd.read_csv)(fn)
total = delayed(sum)(df.groupby('Origin').DepDelay)
sums.append(total)
sums
And the output sums are indeed delayed (see the screen shot of output below):
However, when I try to "compute" the sums in the next step by invoking compute(sums), I got the error message:
Any suggestion on how to fix this (short of declaring a groupby variable first as shown in the solution) is greatly appreciated!
This looks a bit strange
>>> total = delayed(sum)(df.groupby('Origin').DepDelay)
That's call the sum
function on a pandas SeriesGroupBy object, which doesn't support that
In [8]: import dask.delayed
In [9]: import pandas as pd
In [10]: df = pd.DataFrame({"A": [1, 2, 3, 4], "B": [0, 0, 1, 1]})
In [11]: sum(df.groupby("B").A)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-11-b8b47de6ecd0> in <module>
----> 1 sum(df.groupby("B").A)
TypeError: unsupported operand type(s) for +: 'int' and 'tuple'
Let me know if that explanation doesn't make sense.
While working on the exercise on this note book on NYC airport average delays, I ran into a question on how to correctly "marry" delayed with groupby function. If, instead of declaring a "groupby" variable in the code, I use delayed directly on the groupby:
Then this block of code gives me the error message:
This is again probably not directly related to the delayed function here but what would be the correct way to layer "delayed" function on top of groupby? Thank you very much!