In the notebook 02_bags.ipynb, the groupby vs foldby example using account data, groupby doesn't group all data and shows a different result than foldby equivalent code.
MVCE:
%run prep.py -d accounts
import json
from dask.distributed import Client
import dask.bag as db
client = Client(n_workers=4)
filename = os.path.join('data', 'accounts.*.json.gz')
lines = db.read_text(filename)
# Warning, this one takes a while...
result = js.groupby(lambda item: item['name']).starmap(lambda k, v: (k, len(v))).compute()
print(sorted(result))
It should show the same output as this code using foldby:
# This one is comparatively fast and produces the same result.
from operator import add
def incr(tot, _):
return tot + 1
result = js.foldby(key='name',
binop=incr,
initial=0,
combine=add,
combine_initial=0).compute()
print(sorted(result))
This looks like a known issue for a while. First reported in dask 2.20 and fixed in dask/dask#6640 , but still open in dask/dask#6723
I put the bug report here because the has been for a while, and it took me some time to found these issues.
Regarding what to do next I would like to know maintainers opinions, the options I see are:
wait for the fix, keeping open the issue until then
also reference this issue in the tutorial
change the tutorial to group by numerics
Environment:
Dask version: 2.20 and 2021.05.0
Python version: 3.8.10
Operating System: Ubuntu 18.04.5
Install method (conda): conda env create -f binder/environment.yml for 2.20 and conda update dask for 2021.05.0
also tested using Binder from the link in the Readme
What happened:
In the notebook
02_bags.ipynb
, thegroupby
vsfoldby
example using account data,groupby
doesn't group all data and shows a different result thanfoldby
equivalent code.MVCE:
Shows
What you expected to happen:
It should show the same output as this code using
foldby
:Output:
Anything else we need to know?:
This looks like a known issue for a while. First reported in dask 2.20 and fixed in dask/dask#6640 , but still open in dask/dask#6723
I put the bug report here because the has been for a while, and it took me some time to found these issues. Regarding what to do next I would like to know maintainers opinions, the options I see are:
Environment:
conda env create -f binder/environment.yml
for 2.20 andconda update dask
for 2021.05.0also tested using Binder from the link in the Readme