Open bsesar opened 2 years ago
Thanks for this example. I'm currently trying to reproduce but a few notes up front
left_ddf = left_ddf.repartition(npartitions=40)
(40 is kind of random, you might even be better of with a smaller value since htere are fewer rows left)Thank you for the comments, @fjetter. My answers are below:
shuffle='tasks'
does not seem to have a problem with them (i.e., merging takes about 2 minutes in that case). That being said, reducing the number of partitions in the left dataframe by half does reduce the merging time in half. So it seems that shuffle='disk'
is quite sensitive to the number of dataframes, while shuffle='tasks'
is not as sensitive (or is just too fast to not make any notable difference).Nrows_right = 1038
, Nrows_left = 99
, and N_partitions_left = 600
in the MVE code. Merging with shuffle='tasks'
takes 0.4 min, and merging with shuffle='disk'
takes 2.4 minutes, or a factor of 6 difference.In the end, my point is that on the same machine with the same data and disk, merging with shuffle='tasks'
is much faster than merging with shuffle='disk'
.
What happened: When using
shuffle = 'disk'
merging took 50 minutes compared to 2 minutes when usingshuffle = 'tasks'
. Also, Dask dashboard was showing very low CPU utilization when usingshuffle = 'disk'
.What you expected to happen: I expected merging with
shuffle = 'disk'
to be faster than merging withshuffle = 'tasks'
since the load on the scheduler is supposed to be lower in that case.Minimal Complete Verifiable Example:
EDIT: If the above MVE is too much for your machine, use
Nrows_right = 1038
,Nrows_left = 99
, andN_partitions_left = 600
. Merging withshuffle='tasks'
then takes 0.4 min, and merging withshuffle='disk'
takes 2.4 minutes, or a factor of 6 difference (with n_workers=14 and threads_per_worker=1).Environment: