Open bnaul opened 6 years ago
So first, lack of shuffling on a single partitition dataframe should be expected. There is only one partition so everything can be done within that one pandas dataframe.
I'm going to walk through this problem narrating my thoughts. Hopefully this helps in future debugging.
One problem here is that the first call to read_parquet doesn't seem to collect it's min/max index values, which is unfortunate (cc @martindurant):
In [12]: ddf = dd.read_parquet('/tmp/test.parquet')
In [13]: ddf.divisions # None means unknown here
Out[13]: (None, None)
And so when we repartition we get another dataframe, also with unknown partitions
In [14]: ddf2 = ddf.repartition(npartitions=100)
In [15]: ddf2.divisions[:10]
Out[15]: (None, None, None, None, None, None, None, None, None, None)
Of course, we would still expect the eventual set_index operation to check the values for sortedness, and a quick check shows that yes, the data is sorted.
In [16]: s = ddf2.uid.compute()
In [17]: s.is_monotonic_increasing
Out[17]: True
When I call the set index operation it does appear to do the right thing, and the dataframe with the index set has only a linear amount of extra tasks (3x in this case) not quadratic or n*log(n)
In [18]: ddf3 = ddf2.set_index('uid')
In [19]: ddf3
Out[19]:
Dask DataFrame Structure:
x y
npartitions=100
0008cc68-408d-4359-876b-d24bc7bbf124 float64 float64
02f51302-785d-460d-b6f9-5038c50e7e14 ... ...
... ... ...
fd6a1eeb-5613-40d1-9e4b-0b9fc75e8eda ... ...
fffa93f8-4802-4194-bc1d-76cf0372949d ... ...
Dask Name: sort_index, 302 tasks # <<------- there are hundreds of tasks here, not thousands
Similarly for the groupby-apply
In [23]: ddf3.groupby('uid').apply(sum)
/home/mrocklin/Software/anaconda/bin/ipython:1: UserWarning: `meta` is not specified, inferred from partial data. Please provide `meta` if the result is unexpected.
Before: .apply(func)
After: .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result
or: .apply(func, meta=('x', 'f8')) for series result
#!/home/mrocklin/Software/anaconda/bin/python
Out[23]:
Dask DataFrame Structure:
x y
npartitions=100
0008cc68-408d-4359-876b-d24bc7bbf124 float64 float64
02f51302-785d-460d-b6f9-5038c50e7e14 ... ...
... ... ...
fd6a1eeb-5613-40d1-9e4b-0b9fc75e8eda ... ...
fffa93f8-4802-4194-bc1d-76cf0372949d ... ...
Dask Name: _groupby_slice_apply, 402 tasks
So in short, I can't reproduce. But hopefully by going through the process above you can identify where our paths diverge.
In [25]: c.get_versions()['client']
Out[25]:
{'host': [('python', '3.6.2.final.0'),
('python-bits', 64),
('OS', 'Linux'),
('OS-release', '4.13.0-26-generic'),
('machine', 'x86_64'),
('processor', 'x86_64'),
('byteorder', 'little'),
('LC_ALL', 'None'),
('LANG', 'en_US.UTF-8'),
('LOCALE', 'en_US.UTF-8')],
'packages': {'optional': [('numpy', '1.14.0'),
('pandas', '0.22.0'),
('bokeh', '0.12.14rc1'),
('lz4', '0.10.1'),
('blosc', '1.5.1')],
'required': [('dask', '0.16.1+42.gc736c53'),
('distributed', '1.20.2+60.gfd9a68c'),
('msgpack', '0.4.8'),
('cloudpickle', '0.5.2'),
('tornado', '4.5.2'),
('toolz', '0.9.0')]}}
Think I may have (partially) cracked it. I tried the steps you posted above and I see:
In [10]: ddf = dd.read_parquet('/tmp/fastparquet.parquet')
In [11]: print(ddf.divisions)
(0, 499999)
In[12]: ddf2 = ddf.repartition(npartitions=100)
In[13]: print(ddf2.divisions[:10])
(0, 4999, 9999, 14999, 19999, 24999, 29999, 34999, 39999, 44999)
In[14]: ddf3 = ddf2.set_index('uid')
In[15]: ddf3.groupby('uid').apply(sum)
Out[15]:
Dask DataFrame Structure:
x y
npartitions=100
0000ea10-7114-4f33-8559-676920266aca float64 float64
02b24265-fe22-4cfc-aa6f-88caa0cba5d1 ... ...
... ... ...
fd971227-cdac-4b7d-ad74-5550a230166b ... ...
fffa32cc-6cee-47e8-81a3-91cf873234f7 ... ...
Dask Name: _groupby_slice_apply, 3401 tasks
Quite different from your result even though my package versions seem comparable (np==1.14
, dask==0.16.1
, distributed==1.20.2
). But I realized I didn't specify the parquet engine in my example; this was using fastparquet
so I tried again with pyarrow
:
In[23]: dd.read_parquet('/tmp/pyarrow.parquet').divisions
Out[23]: (None, None) # etc.
So I guess it's related to the storage engine? Not sure if that's expected behavior or not (also not sure if using compression would affect anything but I specified None
in my example).
Generally, fastparquet does compute the max/min of every column (although you can choose not to, at least in the direct fastparquet API), and if you use a column with ordered max/min as the index on load, then you should get known divisions.
I don't know about arrow.
I would have thought that having known divisions should, if anything, reduce the number of tasks rather than increase them. Certainly compression and other encoding options will have no effect on the graph planning, only on how long one given data-loading task might take.
parquet-cpp
also computes by default min/max on every column. Probably we need to add more Python interfaces to access these values but in general, the divisions "discovery" should be engine-independent code.
You can use ddf.persist()
rather than c.persist(ddf)
if desired.
OK, I'm explicitly specifying engine='fastparquet'
now in both the to_parquet and read_parquet operations.
In [12]: ddf.divisions
Out[12]: (0, 499999)
In [13]: ddf = c.persist(ddf.set_index('uid'))
In [14]: ddf.divisions
Out[14]:
('00013b78-2281-4eba-8433-1319150a2248',
'fffc1e9a-4674-4a6f-9f18-8b62a51b87e9')
OK, so now we repartition to 100 partitions. Typically here we would just interpolate between the existing division values. Unfortunately, we don't have nice code to do interpolation on strings, and so we give up and resort to unknown divisions.
In [15]: ddf = c.persist(ddf.repartition(npartitions=100)) # no shuffle if we d
In [16]: ddf.divisions[:10]
Out[16]: (None, None, None, None, None, None, None, None, None, None)
Any groupby-apply on this will result in a full shuffle.
In [27]: ddf.groupby('uid').apply(sum)
Dask DataFrame Structure:
x y
npartitions=100
float64 float64
... ...
... ... ...
... ...
... ...
Dask Name: _groupby_slice_apply, 3400 tasks # <<-- note thousands of tasks
Ironically we could avoid this by resetting and then re-setting the index. This would force Dask to look at the values in the column.
In [30]: ddf2 = ddf.reset_index()
...: ddf2 = ddf2.set_index('uid')
...: ddf2
Dask DataFrame Structure:
x y
npartitions=100
00013b78-2281-4eba-8433-1319150a2248 float64 float64
026170b9-b54a-4838-b295-4f27c12babf7 ... ...
... ... ...
fd818866-8b2c-43af-ab20-b4df229ee9ac ... ...
fffc1e9a-4674-4a6f-9f18-8b62a51b87e9 ... ...
Dask Name: sort_index, 400 tasks
We could develop code to interpolate between string index values. See code here
Following up on https://stackoverflow.com/questions/48592049/dask-dataframe-groupby-apply-efficiency/48592529 with an example.
Read data w/ a sorted index column and perform a groupby; shouldn't require a shuffle:
Looks good! But re-running with the repartition uncommented:
Based on my reading of the docs it doesn't seem like
repartition
should be so problematic if my index is sorted; am I just misinterpreting or is unexpected behavior?