dask-contrib / dask-histogram

Histograms with task scheduling.
https://dask-histogram.readthedocs.io
BSD 3-Clause "New" or "Revised" License
23 stars 4 forks source link

dask_histogram creates MaterializedLayers which are slow to optimize over multiple partitions #72

Closed lgray closed 1 year ago

lgray commented 1 year ago

Long story short:

https://github.com/dask-contrib/dask-histogram/blob/main/src/dask_histogram/core.py#L728-L766 and https://github.com/dask-contrib/dask-histogram/blob/main/src/dask_histogram/core.py#L1019-L1049

create MaterializedLayers because we're passing dicts to HighLevelGraph.from_collections, which are then immediately turned into MaterializedLayers. MaterializedLayers cause optimization to scale with the number of input partitions, which is untenable for realistic dataset chunk multiplicities.

I have tried to use https://github.com/dask/dask/blob/main/dask/layers.py#L1090 (DataFrameTreeReduction) but that's resulted in some fairly odd outcomes that I don't quite understand.

This must be fixed before we even begin to think about telling people they should use this package for HEP analysis.

lgray commented 1 year ago

Closed via #74