feat: lazy taskgraph generation, multifills for dask-boost-histograms

lgray commented 7 months ago

This is mostly just logging for posterity, since it shows there is at least one solution to the issue. I'll get the problematic code to @martindurant as well so that we can properly characterize it.

So far:

build task graphs for dask-boost-histograms only when asked for, caching result (delicate!)
prototype of tuples of arguments into dask-boost-histogram fills that allow multiple fills to happen in a single staged layer

This appears to have some nice scaling benefits, but we are figuring out why.

Largely posting this PR to demonstrate what solves memory and task-graph problems when approaching ~O(50k) fills. Not a real solution yet.

lgray commented 7 months ago

Example of multi-fill syntax:

axes_fill_info_dict = {
    dense_axis_name : dense_variables_array_with_cuts["lep_chan_lst"][sr_cat][dense_axis_name],
    "weight"        : tuple(masked_weights),
    "process"       : histAxisName,
    "category"      : sr_cat,
    "systematic"    : tuple(wgt_var_lst),
}
hout[dense_axis_name].fill(**axes_fill_info_dict)

Here showing a fill where we pass multiple weights corresponding to systematic variations. This takes a taskgraph that was ending with 6GB memory usage (per dataset) and brings it to O(1GB), similarly the time to build the task graph is significantly reduced. ~1600 fill calls down from ~41k, many fewer layers, etc.

lgray commented 7 months ago

Multifill moved to #126, which supersedes this PR

dask-contrib / dask-histogram

feat: lazy taskgraph generation, multifills for dask-boost-histograms #125