feature request: Filling really large histograms that are distributed across a cluster.

lgray commented 1 year ago

It's often the case in HEP analysis that scientists will prefer making a very large (6 dense dimensions, 20-30 bins per dense dim) multidimensional histogram to store their reduced data, and this histogram will often have a large number of categories (between datasets and systematic variations). The full size of such a histogram often exceeds 2 GB and can easily be up to 8 GB in practice, depending on the analysis! This creates a clear resource problem when the standard batch slot size LHC experiments is 2GB/core, and the most freely available batch slot is 2GB/1core. Usually larger allocations are penalized heavily in terms of normal users' batch priority and/or require waiting for node drainage to schedule.

An immediate solution to this would be the ability to scatter the histogram across the accumulated compute resources so that the complete histogram with all its categories is never materialized until it is on the node the user submitted the task from in the first place (which tends to have 10s of GB of memory available). Or to write the histograms immediately to disk somewhere in parts (as a dask dataset too possibly)?

Since distributing the dense dimensions is comparatively harder to reassemble, perhaps a first place to start would be allowing a histogram to be distributed by its categorical dimensions. That is that each tuple of categories (dataset, systematic1, systematics2, ...) be a node in a dask task graph that we can send data to fill that instance of the histogram for that category tuple. This would restrict the typical histogram memory need to at most each about ~1GB (i.e. just the dense bins) which fits stably in these slots. The user is then free to collect those histogram parts on their local machine and merge the histograms into the final data product.

I'm happy to discuss this further, this is really just writing down thoughts about what would grab the attention of many HEP physicists actively doing analysis (giant multi-dim histograms using too much memory is the most common mode of failure for awkward-based analyses). Structurally I don't think it would be too bad to generate such a structure in dask, I don't know the specifics of really pulling it off, but could likely do so with guidance. We could also probably hide it all beneath the current fairly pleasant interface!

lgray commented 1 year ago

This appears to somewhat be the case given this output from a fill that looks like this:

import awkward as ak
import dask_awkward as dak
import dask_histogram as dhist
import dask_histogram.boost as dhb

dahist = dhb.Histogram(
    dhist.axis.StrCategory(["central"] + [f"{unc}_up" for unc in jet_factory.uncertainties()] + [f"{unc}_down" for unc in jet_factory.uncertainties()], growth=True),
    dhist.axis.Regular(40, 0, 400),
    storage=dhist.storage.Weight(),
)

dahist.fill(dak.from_awkward(ak.Array(["central"]*len(ak.flatten(jets))), 1), dak.flatten(corrected_jets.pt))
for unc in jet_factory.uncertainties():
    dahist.fill(dak.from_awkward(ak.Array([f"{unc}_up"]*len(ak.flatten(jets))), 1), dak.flatten(corrected_jets[unc].up.pt))
    dahist.fill(dak.from_awkward(ak.Array([f"{unc}_down"]*len(ak.flatten(jets))), 1), dak.flatten(corrected_jets[unc].down.pt))

yields a task graph like (this was reduced to 11 variations otherwise graphviz has an aneurism): The full version executes very quickly as expected from this topology and that seems largely really good to go! Though I am not sure if in its current form this will fix the memory-spiking issue, since for a normal growth-axis the memory use would increase multiplicatively with each tree-reduce layer.

However - running this brought up some other issues. In particular I don't be able to seem to use syntax like:

histogram.fill( "some-category", some_array)

syntax as is possible with boost histograms, which leads to some ugly idioms when filling histograms as you can see in the code above. Similarly dask_histogram StrCategory axes with growth=True do not function as expected from boost histogram, and this is extremely necessary.

lgray commented 1 year ago

All follow-on issues here have been address!

dask-contrib / dask-histogram

feature request: Filling really large histograms that are distributed across a cluster. #35