dask / dask-tutorial

Dask tutorial
https://tutorial.dask.org
BSD 3-Clause "New" or "Revised" License
1.83k stars 702 forks source link

02_bag groupby vs foldby example fails due to known bug #213

Closed gmiretti closed 2 years ago

gmiretti commented 3 years ago

What happened:

In the notebook 02_bags.ipynb, the groupby vs foldby example using account data, groupby doesn't group all data and shows a different result than foldby equivalent code.

MVCE:

%run prep.py -d accounts
import json
from dask.distributed import Client
import dask.bag as db

client = Client(n_workers=4)
filename = os.path.join('data', 'accounts.*.json.gz')
lines = db.read_text(filename)
# Warning, this one takes a while...
result = js.groupby(lambda item: item['name']).starmap(lambda k, v: (k, len(v))).compute()
print(sorted(result))

Shows

[('Alice', 285), ('Alice', 287), ('Alice', 308), ('Alice', 311), ('Bob', 216), ('Bob', 216), ('Bob', 234), ('Bob', 234), ('Charlie', 219), ('Charlie', 219), ('Charlie', 234), ('Charlie', 238), .... , ('Zelda', 259), ('Zelda', 259), ('Zelda', 281), ('Zelda', 284)]

What you expected to happen:

It should show the same output as this code using foldby:

# This one is comparatively fast and produces the same result.
from operator import add
def incr(tot, _):
    return tot + 1

result = js.foldby(key='name', 
                   binop=incr, 
                   initial=0, 
                   combine=add, 
                   combine_initial=0).compute()
print(sorted(result))

Output:

[('Alice', 1191), ('Bob', 900), ('Charlie', 910), ...., ('Zelda', 1083)]

Anything else we need to know?:

This looks like a known issue for a while. First reported in dask 2.20 and fixed in dask/dask#6640 , but still open in dask/dask#6723

I put the bug report here because the has been for a while, and it took me some time to found these issues. Regarding what to do next I would like to know maintainers opinions, the options I see are:

Environment:

also tested using Binder from the link in the Readme

jsignell commented 2 years ago

This notebook has been removed from the most recent version of the tutorial, so I'll go ahead and close this issue. But thank you for reporting it!