When using from_delayed, we generate a single expression per Delayed object. This then generated a compound key of the Delayed object's key and a hard-coded 0: (self.obj.key, 0)
Due to this compound key, distributed picks the Delayeds key as the task group, which generates a single task group per Delayed object.
import dask.dataframe as dd
import pandas as pd
import pprint
from dask import delayed
from distributed import Client
def foo(x):
return pd.DataFrame({"x": [x]})
delayeds = [delayed(foo)(i) for i in range(10)]
df = dd.from_delayed(delayeds)
with Client(set_as_default=True) as client:
result = df.compute()
pprint.pprint(client.run_on_scheduler(lambda dask_scheduler: dask_scheduler.task_groups))
Problem
When using
from_delayed
, we generate a single expression perDelayed
object. This then generated a compound key of theDelayed
object's key and a hard-coded0
:(self.obj.key, 0)
Due to this compound key,
distributed
picks theDelayed
s key as the task group, which generates a single task group perDelayed
object.Impact
As described in https://github.com/dask/distributed/issues/8677, this can overwhelm the scheduler of a
distributed
cluster.