Open lgray opened 1 year ago
If you then add on met corrections:
name_map["METpt"] = "pt"
name_map["METphi"] = "phi"
name_map["JetPhi"] = "phi"
name_map["UnClusteredEnergyDeltaX"] = "MetUnclustEnUpDeltaX"
name_map["UnClusteredEnergyDeltaY"] = "MetUnclustEnUpDeltaY"
met = events.MET
met_factory = CorrectedMETFactory(name_map)
corrected_met = met_factory.build(met, corrected_jets)
You'll find a 4655 node graph:
%time corrected_met.JES_AbsoluteStat.up.pt.compute(optimize_graph=False)
CPU times: user 2.24 s, sys: 86.7 ms, total: 2.33 s
Wall time: 2.28 s
vs.
%time corrected_met.JES_AbsoluteStat.up.pt.compute(optimize_graph=True)
CPU times: user 13.9 s, sys: 41.4 ms, total: 14 s
Wall time: 14 s
In both cases when optimization is invoked the process collapses the graph to a single node.
Just chiming in whilst people are on holiday: the optimisation strategy currently includes an approach to reduce unneeded disk access by pruning un-needed inputs. There are limitations to the current approach, which I believe involves a brute-force optimisation strategy, that mean the optimisation is both slow and redundant if you're using most of the inputs. My guess is that this brute-forcing is what you're seeing here. You could test this by manually invoking the optimsation step on the graph, and see if it hangs.
It might also be that execution is actually slower, i.e. the optimised graph has longer wall-time. That's a separate kettle of fish.
There's no disk access in this case since dak is reading from awkward arrays in memory. But... perhaps in January.
I see that for the optimized graph, execution time is 500ms, so it is faster to run.
Profiling the optimization, all the time is taken by optimize_blockwise
, a dask builtin, and not by the brute force method mentioned by @agoose77 (which isn't on by default atm). Particularly, daks.blockwise.subs is called 7.6M times. I would care to bet that it has always been "fast enough" so far, but that the rewriting strategy has N**2 (or worse) characteristics - and this is the first time working with any high-level graph with 745 layers.
I will see what can be done, short of turning off optimization altogether.
Side node: HLG optimization is particularly necessary when working with a distributed scheduler. With the threaded scheduler, there is no serialization/transmit cost, and the overhead per task is very small. Converting the high-level to low-level graph, you get about the same execution time with or without further optimization. Of course, IO optimization (column pruning) would still make a big difference in many workloads.
A further comment that, although I have no idea what the example code is doing, I suspect it could be done in a single call to map_partitions, with a function that does all that stuff to one piece of the input ak array. I don't immediately see any aggregation steps.
Ok that's interesting. I'm surprised it's so much slower in my case (I'm running on osx/arm64).
I also agree this is likely very easy to optimize with fusing! The 700 nodes was the outcome of doing a straight port from awkward.virtual
awkward1 implementation.
I'm surprised it's so much slower in my case (I'm running on osx/arm64).
You misunderstand me: I had the same wall time as you, I was simply saying that the compute part (after optimizing) was faster as a result of optimizing, as one would hope. If only the optimization hadn't outweighed the benefit.
Ah, sorry, I understand now. That is good. We'll get it ship shape in no time I am sure. :-)
So after thinking about this a bit what I hope to achieve optimization wise is that even after zipping the products together, when filling (systematic variations of) histograms we avoid any recalculations of other corrections and avoid fusing too many histogram fill operations together on a single node.
I should also point out that this is strictly a subgraph of an analysis and not a complete one and it's fundamentally a one-to-many mapping from uncorrected jets to O(10-50) variations of corrected jets. All the corrected jets come from the same set of initial corrections which can certainly be fused calculation-wise (hopefully that result can be reused automatically). The variations are trivially parallelized taking this base corrected jet. For the user interface we zip them all back together into a central estimate with a list of cloned and partially substituted (i.e. with varied kinematics) jet variants.
The final zip takes this rather splayed graph and pulls it all to one node (for a decent user interface that we would like to preserve). If these varied 'leaves' each go into a histogram is dask smart enough parallelize the calculation of variants and filling of histograms starting from a common corrected jet node?
I suppose an outcome is that just replicating the whole correction and variation stack is faster at scale, but we know for sure that locally caching the 'central' corrected jet is incredibly efficient.
Good news is dask_histogram
appears to have enough smarts to do this out of the box. A 111 category histogram fill seems to also spend most of its time in optimization and is otherwise mega-speedy. Getting it to yield a task graph visualization seems to take a while though.
I'll condense the graph these corrections generate so they're less sprawling and that should help many things.
Though there are some other things that take us to the dask-histogram repository that will need to be worked out (UI quirks, growth axes not working as expected).
For a limited set of systematics we get a quite nice condensed graph (though it isn't very indicative of what's being filled, but that's ok - I checked that the resulting histogram is correct). So we are indeed optimizing as I would hope!:
Starting a new topic since there is a concrete example:
You'll need to install coffea from this branch:
https://github.com/CoffeaTeam/coffea/tree/awkward2_dev
(pip install -e '.[dev]') You'll need to install dask_awkward from this branch:https://github.com/lgray/dask-awkward/tree/feat_dak_with_field
Once you're there, go to the coffea install directory spin up a jupyter lab python3 notebook and paste the following into a cell:
and you'll get this rather amazing task graph with 743 keys. What's even more interesting is that the graph executes faster with optimization turned off! optimization off:
optimization on:
Both results are correct! I'm sure we just need to think about when
fuse
andinline
operations are allowed in awkward task graphs, especially for highly structured data since the data's form implies where caching/inlining are helpful. Anyway - this gets us super close to having realistic analysis use cases already.The other major bit is the
dak.from_buffers
style interface to relabel data and mutate interfaces.