dask / dask

Parallel computing with task scheduling
https://dask.org
BSD 3-Clause "New" or "Revised" License
12.51k stars 1.7k forks source link

Efficient cloning of taskgraphs on top of new input partitions? #11435

Open lgray opened 4 days ago

lgray commented 4 days ago

Hi!

I'm a high energy particle physicist using dask, dask-awkward, and dask-histogram to compute complex analyses over billions of rows with many variations of systematic uncertainties to create rich statistical models of our collider data.

The task graphs that we generate are often rather complex and can easily reach thousands of nodes before optimization for a single dataset, and we then construct the task graph for multiple datasets. This can be time consuming (tens of minutes for a ~200 datasets!), resulting in sluggish user experience that is largely because of repeated work. The structure of the task graph likely only needs to be calculated 2-3 times at most, depending on if the dataset is signal simulation, background simulation, or experiment data.

Would it be possible to instead calculate the structure, which is the expensive part, once and then re-key the graph for new input partitions (and rekeying all dependents of the inputs)? This should be much faster since it's essentially walking through the taskgraph, calculating some hashes, and creating the appropriate number of new partitions. This approach may even work for tree-reduction or repartitioning layers (in modes where you don't need to know the partition structure a priori), since these are often calculated only when the low level graph is finally materialized.

Has this operation already been implemented somewhere (I couldn't find anything)? Is this a reasonable thing to do? Are there any missing necessary components in dask to implementing a strategy like this?

Thanks in advance!

mrocklin commented 4 days ago

My first suggestion would be to profile graph construction. That might give you a better understanding of what's slowing things down and give you ideas on where to focus effort. It could be that what you suggest is the best course of action, but if you haven't yet gone through a round of profiling and optimization I wouldn't be surprised if there was a 10x cost reduction in there somewhere.

On Wed, Oct 16, 2024, 12:23 PM Lindsey Gray @.***> wrote:

Hi!

I'm a high energy particle physicist using dask, dask-awkward https://github.com/dask-contrib/dask-awkward, and dask-histogram https://github.com/dask-contrib/dask-histogram to compute complex analyses over billions of rows with many variations of systematic uncertainties to create rich statistical models of our collider data.

The task graphs that we generate are often rather complex and can easily reach thousands of nodes before optimization for a single dataset, and we then construct the task graph for multiple datasets. This can be time consuming (tens of minutes for a ~200 datasets!), resulting in sluggish user experience that is largely because of repeated work. The structure of the task graph likely only needs to be calculated 2-3 times at most, depending on if the dataset is signal simulation, background simulation, or experiment data.

Would it be possible to instead calculate the structure, which is the expensive part, once and then re-key the graph for new input partitions (and rekeying all dependents of the inputs)? This should be much faster since it's essentially walking through the taskgraph, calculating some hashes, and creating the appropriate number of new partitions. This approach may even work for tree-reduction or repartitioning layers (in modes where you don't need to know the partition structure a priori), since these are often calculated only when the low level graph is finally materialized.

Has this operation already been implemented somewhere? Is this a reasonable thing to do? Are there any missing necessary components in dask to implementing a strategy like this?

Thanks in advance!

— Reply to this email directly, view it on GitHub https://github.com/dask/dask/issues/11435, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTCUKHDLNN2VTFFHL53Z32OKNAVCNFSM6AAAAABQB6VTB6VHI2DSMVQWIX3LMV43ASLTON2WKOZSGU4TENJWGM3DSNA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

lgray commented 3 days ago

@mrocklin Thanks for the reply - we have already done one pass of extensive profile guided optimization, which got this process from taking > 1 hour to the ~10 minutes quoted above.

mrocklin commented 3 days ago

I think the next question would then be "what is the slow part?"

At the end of the day making layers of a Dask graph is just constructing Python dictionaries. How to make a Python dictionary quickly isn't something that we really control. It has much more to do with the code in the libraries you mention, rather than in anything in the core dask library itself. If you think you can make a dictionary more quickly by creating a template first then that makes sense to pursue. This isn't something that we have much say in.

If there is some function in the Dask library that is particularly slow then we can look at that function. My guess though is that you'll find that most of the issue here is downstream of the core dask library.

lgray commented 3 days ago

I can pull up the profiles in a bit for you to see (not at the right computer) but generally it boils down to that we are doing a very large number of small operations resulting in building that dictionary and predicting the output types (a la dask-array) takes quite some time. More succinctly: our task graphs are just big and it's kinda awkward to deal with.

Largely - it's the output type prediction that takes a significant amount of time, which is what gave me the idea for the proposal above and interest in corresponding tools. We've already optimized the type tracing/prediction to point of selectively removing correctness guarantees when we know they are met for other reasons. However, since we don't necessarily need to build the task graph again for the same data flow, it seemed prudent to instead rekey a graph that we know will be the same for a given dataset.

Joining all the datasets that share the same dataflow together creates other problems since it's easy to end up in a situation where you have a few billion individual tasks even after graph optimization. This creates other problems.

Revisiting the joining just now brings up another possible idea - if we only did the joining as a "logical" step and then "unjoined" the task graph at the end for actual submission it could be a way around. Though I can imagine it being a bit clunky for users, but that's also an engineering problem more than fundamental usage.