rabernat commented 5 years ago

In my work with large climate datasets, I often concoct calculations that cause my dask workers to run out of memory, start dumping to disk, and eventually grind my computation to a halt. There are many ways to mitigate this by e.g. using more workers, more memory, better disk-spilling settings, simpler jobs, etc. and these have all been tried over the years with some degree of success. But in this issue, I would like to address what I believe is the root of my problems within the dask scheduler algorithms.

The core problem is that the tasks early in my graph generate data faster than it can be consumed downstream, causing data to pile up, eventually overwhelming my workers. Here is a self contained example:

import dask.array as dsa

# create some random data
# assume chunk structure is not under my control, because it originates
# from the way the data is laid out in the underlying files
shape = (500000, 100, 500)
chunks = (100, 100, 500)
data = dsa.random.random(shape, chunks=chunks)

# now rechunk the data to permit me to do some computations along different axes
# this aggregates chunks along axis 0 and dis-aggregates along axis 1
data_rc = data.rechunk((1000, 1, 500))
FACTOR = 15

def my_custom_function(f):
    # a pretend custom function that would do a bunch of stuff along
    # axis 0 and 2 and then reduce the data heavily
    return f.ravel()[::15][None, :]

# apply that function to each chunk
c1 = math.ceil(data_rc.ravel()[::FACTOR].size / c0)
res = data_rc.map_blocks(my_custom_function, dtype=data.dtype,
                         drop_axis=[1, 2], new_axis=[1], chunks=(1, c1))

res.compute()

(Perhaps this could be simplified further, but I have done my best to preserve the basic structure of my real problem.)

When I watch this execute on my dashboard, I see the workers just keep generating data until they reach their memory thresholds, at which point they start writing data to disk, before my_custom_function ever gets called to relieve the memory buildup. Depending on the size of the problem and the speed of the disks where they are spilling, sometimes we can recover and manage to finish after a very long time. Usually the workers just stop working.

This fail case is frustrating, because often I can achieve a reasonable result by just doing the naive thing:

for n in range(500):
    res[n].compute()

and evaluating my computation in serial.

I wish the dask scheduler knew to stop generating new data before the downstream data could be consumed. I am not an expert, but I believe the term for this is backpressure. I see this term has come up in https://github.com/dask/distributed/issues/641, and also in this blog post by @mrocklin regarding streaming data.

I have a hunch that resolving this problem would resolve many of the pervasive but hard-to-diagnose problems we have in the xarray / pangeo sphere. But I also suspect it is not easy and requires major changes to core algorithms.

Dask version 1.1.4

TomAugspurger commented 5 years ago

Sorry, I was off last week. Getting back to it this week. But a short status update.

This branch does two things

Gradually learns the likely net memory usage of completing a task, by observing the memory use of the output less the memory use of all freed dependencies.
Adjust the Worker scheduling logic to prioritize memory-freeing tasks when we notice that we're under memory pressure.

Part 1 works well enough for now, I think (task annotations may help make things more exact in the future).

I'm struggling with part 2. I'm having trouble finding an acceptable scheduling policy that doesn't totally kill performance in some workloads.

@MikeAlopis if you have additional examples where you're unexpectedly running into worker memory errors, I'd be happy to see them. I think I'll want to test this out on a diverse set of workloads.

mrocklin commented 5 years ago

Would it make sense to merge in part 1 before part 2?

TomAugspurger commented 5 years ago

Would it make sense to merge in part 1 before part 2?

For review ease of review, yes. https://github.com/dask/distributed/pull/2847

TomAugspurger commented 5 years ago

I've been poking at this again. In the examples I've been using, the root cause of the memory blowing up is communication between workers. I see a few ways to address this

Avoid the communication in the first place. This can (sometimes) be done by adjusting the scheduling. Dask's depth-first scheduling isn't the best for work patterns like
```
    a-1   a-2    a-3  a-4
      \   /       \   /
       b-1         b-2
        |           |
       c-1         c-2
```
With two machines, we might schedule a-1 on machine 1 and a-2 on machine 2. Then we'll need to move the result of one to the other machine. We (maybe) would want to schedule a-1 and a-2 on the same machine. I have some thoughts on an API here, but haven't coded anything up yet. This wouldn't be a general solution, since many problems require some communication.
Reduce (or disable) overlapping communication & compute, at least when memory-use is high. This is essentially the backpressure request, limited to "don't schedule new tasks when we're communicating".

mrocklin commented 5 years ago

We (maybe) would want to schedule a-1 and a-2 on the same machine

Indeed. This would be great, and would help to accelerate many workloads if we could make this general.

I have some thoughts on an API here, but haven't coded anything up yet.

I'm mildly against a short term solution here that accepts hints from a user. Historically we've solved all of these scheduling problems at the core so that they affect all uses, not a special few. This approach has been, so far, effective.

If we can find a way to make it cheap enough, something might find its way into the worker objective:

https://github.com/dask/distributed/blob/e02cc4409e352e40e4128fc542b8aaed51b5a01f/distributed/scheduler.py#L4560-L4575

There are things to balance here. It's an interesting problem.

TomAugspurger commented 5 years ago

worker_objective might be too late, at least for the kinds of problems I'm playing with right now.

For problems that start with many inputs, which can be processed (mostly) in parallel, we go down the not-ideal path in the elif self.idle here

https://github.com/dask/distributed/blob/e02cc4409e352e40e4128fc542b8aaed51b5a01f/distributed/scheduler.py#L3612-L3630

Naively that makes sense. But for certain problems we're willing to pay a bit of time to find a better worker for each tasks.

@mrocklin do you know if we have any scheduling logic that looks at the "siblings" of a task? If I have (top to bottom)

        a-1   a-2  a-3   a-4  a-5  a-6
          \    |    /     \    |    /
              b-1             b-2

When I'm scheduling a-2, I'd like to look at where a-1 and a-3 are perhaps scheduled to run (in testing on a toy problem, assigning this task to the worker with the most sibling tasks works).

mrocklin commented 5 years ago

It gets even stranger, right? Not just siblings, but cousins of various levels of removed-ness :) And at some point you're going to want to cut the family tree.

I don't know of any scheduling logic that we have currently. If I were to start solving this problem I would start with pen and paper, write down a variety of situations, figure out how I would go about scheduling them in a few situations, and hopefully build out some intuition for the problem.

dougiesquire commented 5 years ago

Any updates or progress here? I’ve been experiencing symptoms that may be related to the same issue.

When running a groupby.mean operation on a relatively large dataset using a dask cluster the memory usage of each worker increases very rapidly, to values substantially larger than what I would expect based on the chunking. The issue is particularly severe when accessing and processing data stored on an ultra-fast parallel file storage system.

My example also uses xarray, Screen Shot 2019-09-25 at 10 04 25 am

Interestingly, the amount of memory used per worker is a function of the number of workers I provide. For a single worker, the maximum memory load is about what I would expect for this persist (~45GB). For two workers, each max out at 80GB per worker before the calculation finishes. For three workers, they start to spill to disk before the calculation completes (120GB per worker is the maximum I can request on the system I am using).

rabernat commented 5 years ago

Just chiming in to say this remains a persistent and serious problem for many pangeo users. We are keen to help resolve it but not sure the best way to engage.

mrocklin commented 5 years ago

We are keen to help resolve it but not sure the best way to engage.

I get the sense that at this point the right way to engage is to find people who are interested in low memory task scheduling and have them dive into Dask's heuristics.

@TomAugspurger you looked at this problem most recently. Are you able to summarize the current state of things and avenues for work as you see them?

TomAugspurger commented 5 years ago

I'm still poking at it, but all my attempts (https://github.com/dask/distributed/pull/2847, https://github.com/dask/distributed/pull/2940, and others that haven't made it to PR stage) have failed.

Thus far, it's seemed like the most common issues have been communication heavy workloads. So the two solutions I've attempted are better scheduling to not need as much communication, and better handling communication when we are already high on memory.

Collecting workloads that fail is still helpful.

TomAugspurger commented 5 years ago

Turning to @dougiesquire's example (thanks for posting that), I'm roughly trying to replicate it with the following random data. For debugging, the arrays and number of chunks is much smaller.

>>> size = (3, 24, 96, 5, 9, 10)
>>> chunks = (1, 1, 96, 5, 9, 10)
>>> arr = da.random.random(size, chunks=chunks)
>>> arr
>>> 
>>> items = dict(
>>>     ensemble = np.arange(size[0]),
>>>     init_date = pd.date_range(start='1960', periods=size[1]),
>>>     lat = np.arange(size[2]).astype(float),
>>>     lead_time = np.arange(size[3]),
>>>     level = np.arange(size[4]).astype(float),
>>>     lon = np.arange(size[5]).astype(float),
>>> )
>>> dims, coords = zip(*list(items.items()))
>>> 
>>> array = xr.DataArray(arr, coords=coords, dims=dims)
>>> dset = xr.Dataset({'data': array})
>>> dset
<xarray.Dataset>
Dimensions:    (ensemble: 3, init_date: 24, lat: 96, lead_time: 5, level: 9, lon: 10)
Coordinates:
  * ensemble   (ensemble) int64 0 1 2
  * init_date  (init_date) datetime64[ns] 1960-01-01 1960-01-02 ... 1960-01-24
  * lat        (lat) float64 0.0 1.0 2.0 3.0 4.0 ... 91.0 92.0 93.0 94.0 95.0
  * lead_time  (lead_time) int64 0 1 2 3 4
  * level      (level) float64 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0
  * lon        (lon) float64 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0
Data variables:
    data       (ensemble, init_date, lat, lead_time, level, lon) float64 dask.array<chunksize=(1, 1, 96, 5, 9, 10), meta=np.ndarray>

I see two issues. First, fuse_ave_width is again a very important parameter to tune configuration, and the presence of a single root task that's the dependency of all the open_zarr tasks causes issues with optimization. Consider the following four task graphs visualizing the dset['data'].groupby("init_date.month").mean(dim="init_date")

From open_zarr, default fusion:
From open_zarr, with fuse_ave_width=200 (some large number):
The random data (no to / from Zarr), default fusion:
The random data, with fusion:

The Zarr vs. random issue looks like https://github.com/dask/distributed/issues/3032. I'll see what xarray is doing in open_zarr. If it's using from_array I'll see what things look like with concat.

I'm not sure yet how that scales up to the larger datasets @dougiesquire's is working with. Presumably too much fusing there would mean way too large of chunks, but I haven't done the math yet.

See https://nbviewer.jupyter.org/gist/TomAugspurger/8c487aeef83fc018f6fed2b345b366c9

dougiesquire commented 5 years ago

Stupid question: My solution to issues of this type at the moment is to use a single, multi-core worker. This works great for fixing the memory issue, but obviously means I'm limited by the size of a single node. Is there a way to "turn off" inter-worker communication?

TomAugspurger commented 5 years ago

For now, your best bet may be set the distributed.worker.connections.outgoing config value to some small number (3-10?)

https://github.com/dask/distributed/pull/3071 will effectively do that automatically when the worker is under high memory pressure, so that too much communication won't end up killing a high-memory worker, but you still get the benefit of communicating with many workers when memory is normal.

dougiesquire commented 5 years ago

For now, your best bet may be set the distributed.worker.connections.outgoing config value to some small number (3-10?)

Thanks for this, @TomAugspurger. Unfortunately tweaking distributed.worker.connections.outgoing doesn't seem to make much difference at all.

TomAugspurger commented 5 years ago

@dougiesquire what does the dashboard show just before the workers are dying? High memory usage and a lot of communication (overlaid in red)?

dougiesquire commented 5 years ago

@dougiesquire what does the dashboard show just before the workers are dying? High memory usage and a lot of communication (overlaid in red)?

Yup - this is a screen shot of the dashboard as the workers start to die Screen Shot 2019-10-02 at 8 10 25 am

rabernat commented 5 years ago

the presence of a single root task that's the dependency of all the open_zarr tasks causes issues with optimization

What is this "root task"? Is it something we could eliminate on the xarray (or dask) side?

mrocklin commented 5 years ago

What is this "root task"?

By this I mean a task on which many other things depend. It is probably a reference to the Zarr file. You could probably optionally recreate the Zarr file in every task. In some cases that might slow things down. I'm not sure.

mrocklin commented 5 years ago

Folks here might want to take a look at https://github.com/dask/dask/pull/5451

Tom and I were playing with high level fusion, and this was an interesting outcome.

Thomas-Moore-Creative commented 5 years ago

I believe @dougiesquire has provided some example(s) but is there anything additional someone facing this issue could log or document that could assist with a solution?

mrocklin commented 5 years ago

Nothing comes to mind at the moment. This probably needs someone to think hard about the problem and general task scheduling policies for a while. Most of the people who do that regularly are pretty backed up these days.

TomAugspurger commented 4 years ago

Short summary a call with @dougiesquire and @Thomas-Moore-Creative yesterday, looking at the problem described in https://github.com/dask/distributed/issues/2602#issuecomment-534795074.

This is a common

read from zarr
group by month
reduce

workload.

We eventually got things working well by setting fuse_ave_width to a sufficiently large value for the workload. We tried with Dask 2.9 and unfortunately the root fusion wasn't enough to fuse things correctly for this workload. I'm going to explore that a bit more today.

Thomas-Moore-Creative commented 4 years ago

@TomAugspurger - thank you for taking the time to look at the problem with us. Greatly appreciated and part of why we love this Pangeo community.

eriknw commented 4 years ago

I can narrowly comment on @TomAugspurger's example of "open_zarr with fuse_ave_width=200" that resulted in this DAG: zarr-fused

Specifically, this doesn't fuse down similar to "random data with fusion" like random-fused because of the hold_keys optimization introduced in https://github.com/dask/dask/pull/2080.

Without hold_keys, the optimized DAG looks like zarr-fused

To be honest, I still don't fully grok what we want to accomplish with hold_keys, if it's still necessary, or if it can be improved.

mrocklin commented 4 years ago

To be honest, I still don't fully grok what we want to accomplish with hold_keys, if it's still necessary, or if it can be improved.

Often we don't want to inline the array object into every task. This is the case with numpy arrays (they're faster to serialize on their own, and we don't want to copy the full array for every individual task), and for objects like h5py.Datasets which don't work well with pickle. It may be that with Zarr we do want to include the Zarr object within every task individually. This would be the case if it was very cheap to serialize.

eriknw commented 4 years ago

I don't see why hold_keys is necessary for any of what you just said. This may mean I don't understand what you just said, nor do I understand the previous discussions around hold_keys. I understand what the code hold_keys is doing, but I don't get why.

Furthermore, I suspect the graph from open_zarr with fuse_ave_width=200 likely represents a failure of graph optimization, and, for this, hold_keys is the culprit. Whatever the goal of hold_keys, I suspect it's better done as part of fuse. I would prefer to fix this optimization as a more general solution instead of repeating open_zarr in every task.

open_zarr above won't get fused, and it wouldn't get fused if it were a numpy array either.

It seems there are a couple considerations regarding what hold_keys tries to do with getters such as getitem.

getters often return much smaller objects, so we'd sometimes rather send around the object after the getters are applied than before, and
the objects getters are applied to may not be serializable, so, once again, we'd rather send the object after the getters are applied.

Is this correct? Is (2) a primary concern? The consequence of being unserializable is that all direct dependents must be executed on the same worker, right?

I believe the open_zarr example above is a failure case, because it results in sending the original objects to many, many tasks, and then the results of those tasks need to be sent around as part of a reduction. If the user knows their workflow and sets fuse_ave_width very high, then dask should respect that and use the graph with three branches that I shared in my previous post.

I think fuse can be improved if it's given two sets of keys that:

are likely to reduce the size of data, and
are unable to be serialized.

Thoughts?

mrocklin commented 4 years ago

Whatever the goal of hold_keys, I suspect it's better done as part of fuse

That could be. Looking at the docstring of hold_keys, one of the main objectives was to avoid fusing things like "x" below:

{"x": np.array([1, 2, 3, ...]),
 "y": (inc, "x"),
 "z": (dec, "x"),
  ...
}

We don't want "x" to be fused anywhere because it is easier to serialize as a raw value, rather than tucked away into a task.

Furthermore, I suspect the graph from open_zarr with fuse_ave_width=200 likely represents a failure of graph optimization, and, for this, hold_keys is the culprit.

I definitely agree that a width of 200 is a fail case. What are we trying to achieve with that example? I'm not sure that there is an optimization to be made?

For root fusion (I'm not sure that this is what we're talking about here or not) we would want to benefit in cases like the following:

x = da.open_zarr(...)
y = da.open_zarr(...)
z = x + y
z.sum().compute()

Ideally the matching chunks from x and y would be loaded on the same machine. This is easy to guarantee if they are part of the same task. Root fusion does this if the task doesn't have any dependencies, which in this case, they do, which blocks things. I'm not sure that any of the problems presented above have this structure though, so that might not be what we're talking about here.

So, do we know what the ask is here? Do we have confidence that the problem presented above is likely to be resolved by improved optimization?

Is (2) a primary concern? The consequence of being unserializable is that all direct dependents must be executed on the same worker, right?

Not quite. We can serialize these things, but only if they're bare tasks, because then our custom serialization can take over, rather than having to rely on pickle.

takluyver commented 4 years ago

We've been seeing a number of weird errors (like killed workers) for which I think this problem might be the root cause.

We create dask arrays from HDF5 files, stack/concatenate them together to make one big array, and then typically do some slicing and dicing before a reduction step. We're using dask_jobqueue to launch workers through Slurm. As the data involved gets bigger, I believe it's trying to load data too quickly rather than pushing it through to the reduction step. What puzzles me is: what are we doing that's strange? It seems like exactly the kind of use case that dask.array is meant to handle, and clearly not everyone has this problem.

rabernat commented 4 years ago

We create dask arrays from HDF5 files, stack/concatenate them together to make one big array, and then typically do some slicing and dicing before a reduction step.

This is essentially the same pattern as climate science.

It seems like exactly the kind of use case that dask.array is meant to handle, and clearly not everyone has this problem.

Many of us do indeed have this problem. Users have developed several workarounds, none very satisfactory:

Throw more memory / larger cluster at the problem, which sometimes makes it go away
Allow workers to spill to a fast local disk (SSD), which acts like extra memory
Split the job up into smaller batches
Manually tweak your graph, e.g. fusing tasks (hard, but the most promising)

Yesterday @jcrist and I had an interesting chat about an idea that might help out. The idea was to have an explicit "fuse all upstream tasks" function. Maybe that will help us out a bit.

At this point, several institutions have significant amounts of money we could potentially throw at this problem, which has emerged as a serious roadblock for scaling dask. I would welcome ideas from the dask devs about how that money could be effectively deployed to fix this issue at its root cause.

takluyver commented 4 years ago

Thanks! I have observed that throwing more resources at the problem sometimes makes it go away. I'll explore the other approaches you mention.

I also work at such an institution - European XFEL - and my immediate boss is a big fan of supporting the ecosystem of open source tools. So if there were a concrete plan for how money could be applied to make a substantial improvement in this area, I'd push for EuXFEL to contribute. (Obviously this is just me talking, and I can't promise anything)

dougiesquire commented 4 years ago

@takluyver, depending on your HPC system and workflows, another potential (but also not very satisfactory) workaround may be to run your jobs using a single (multi-core) worker. This has helped me get through problematic jobs in the past.

mrocklin commented 4 years ago

At this point, several institutions have significant amounts of money we could potentially throw at this problem, which has emerged as a serious roadblock for scaling dask. I would welcome ideas from the dask devs about how that money could be effectively deployed to fix this issue at its root cause.

I apologize generally for not being able to resolve all of these problems recently. As Dask grows it gets harder to track everything personally. Money definitely helps to prioritize things though :) Thank you @rabernat for bringing this up. I'm starting to take on support contracts at Coiled, and hire some people to help work on them. Money goes a long way to giving us more bandwidth, especially on problems like this one that require some dedicated thinking. @rabernat @takluyver I'll reach out by e-mail.

rabernat commented 4 years ago

I thought I would mention that this sort of problem is currently being discussed on the Pangeo discourse forum: https://discourse.pangeo.io/t/best-practices-to-go-from-1000s-of-netcdf-files-to-analyses-on-a-hpc-cluster/588/9 Folks may find some relevant points in that thread.

itamarst commented 3 years ago

This seems like a good place to bring up a new option that I haven't seen discussed: Linux memory pressure statistics (https://www.kernel.org/doc/html/latest/accounting/psi.html). This is a fairly new metric, and it allows a fundamentally different approach to memory pressure.

Basically, it tells you "this is how much time you've spent waiting for the memory subsystem". So as memory resources get low, you start getting pressure feedback as swapping increases.

This means you don't have to try as hard to come up with estimates of memory usage in advance, you can be reactive to actual usage patterns as the code runs. You can imagine everything from "we should drop this cached data and recalculate" to load shedding to dynamically adjusting chunk sizes, based not just one guesses but an actual measurement of performance impact of current memory usage.

https://www.youtube.com/watch?v=cSJFLBJusVY is a good talk about how Facebook is using it, at a different level of abstraction than Distributed would, but the same approach (relying less on overprovisioning and prediction and more on reacting to actual usage) would likely be quite useful in Distributed as well.

Other notes: there's also CPU and I/O (i.e. disk, I think?) backpressure info, and you can get the info per cgroup (==container).

apatlpo commented 3 years ago

Hi all, thank you very much for exposing this challenge @rabernat ! The issue was very helpful to our team.

I am not sure this is the right place for it but don't know where to push it otherwise. I deeply apologize if this is not the place to report on this.

We have been using for some time an helper method in order to sequentially process large datasets along embarrassingly parallel dimensions and I'd like to share it: here

The method has proven to be very valuable with time and the hope is that it can help other people. There are most likely other approaches to do this and I would be very much looking forward to hear about it.

gjoseph92 commented 3 years ago

So #4892 helps with this a lot. On my laptop (8-core intel MPB, 32GB memory), @rabernat's original example at the top of of this issue takes:

main: 43 minutes, spills ~130GB data to disk, makes my system largely unresponsive
4892: 3.3 minutes, no spilling, reasonable memory usage

main (performance report): Screen Shot 2021-06-25 at 6 31 53 PM

4892 (performance report):

Screen Shot 2021-06-25 at 6 32 06 PM

I want to give a big shout-out to @JSKenyon, who so clearly identified the problem in #4864, and @fjetter who started us towards a solution. Dask had been scheduling the initial data-creation tasks in a very bad way, leading to huge amounts of unnecessary data transfer, and therefore data duplication, and therefore ballooning memory usage.

I made two changes before running these trials:

Added a .copy() at the end of the "pretend memory-reducing function":
```
def my_custom_function(f):
    # a pretend custom function that would do a bunch of stuff along
    # axis 0 and 2 and then reduce the data heavily
    return f.ravel()[::15][None, :].copy()  # <-- key: copy the slice to release `f`
```
This is essential, because the small slice of f we were returning was just a view of f's memory. And that meant all of f's memory had to stick around—so the memory-reducing function was not reducing memory at all! The NumPy docs even mention this:

Care must be taken when extracting a small portion from a large array which becomes useless after the extraction, because the small portion extracted contains a reference to the large original array whose memory will not be released until all arrays derived from it are garbage-collected. In such cases an explicit copy() is recommended.

This is unintuitive behavior, but I wonder how often it happens unknowingly with Dask, and if it plays a role in real-world cases. I wonder if we could add logic to Dask to warn you of this situation, where the memory backing an array result is, say, 2x larger than memory needed for the number of elements in that array, and suggest you make a copy.
Ran under jemalloc to avoid the unreleased memory issue (@crusaderky has explained this clearly in the Dask docs):
```
DYLD_INSERT_LIBRARIES=$(brew --prefix jemalloc)/lib/libjemalloc.dylib python backpressure.py
```
It probably would have still worked without this, but would have spilled more and been messier.

It's also interesting that memory backpressure wouldn't have helped with this exact example snippet. Because of the view-vs-copy issue, that code was essentially trying to compute all 93 GiB of the original data into memory. No matter your order through the graph, that end result's just not possible.

Not to say that we won't ever need memory backpressure (xref #4891). But I wonder how many other cases that feel like Dask is producing more data than it can consume actually involve other sly bugs like the ones here. (Both the view-vs-copy and the poor ordering of root tasks were sly bugs, on behalf of a user and of dask.) And if there are tools/visualizations to help identify these issues that would serve us better than adding more complexity (memory backpressure) to try to work around them.

I'd love to hear if folks who have also run into these memory problems could retry their workloads with this update and report back how things change?

itamarst commented 3 years ago

But I wonder how many other cases that feel like Dask is producing more data than it can consume actually involve other sly bugs like the ones here. ... And if there are tools/visualizations to help identify these issues that would serve us better than adding more complexity (memory backpressure) to try to work around them.

For issues that involve allocating too much memory or holding on to too much memory, you can figure out the sources of peak memory usage using the Fil memory profiler. Though for now you'll need to use the threaded or single-threaded Dask schedulers, it doesn't support subprocesses yet.

gjoseph92 commented 3 years ago

I've tried rerunning a couple examples from this thread under the new PR. Performance is greatly improved.

@rabernat's canonical anomaly-mean example from https://github.com/dask/distributed/issues/2602#issuecomment-498718651

20-worker cluster with 2-CPU, 10GiB workers

main: 624.51s
4967: 155.39s (4x speedup)
4967 + MALLOC_TRIM_THRESHOLD_=0: 93.42s (6x speedup)

@mrocklin originally said about this example:

I don't think that there is a way to compute that example in small memory without loading data twice. You have to load all of the data in order to compute clim. And then you need that data again to compute anom_mean.

The scheduling ask here is "be willing to delete and recompute data" while the scheduling ask before was "prefer tasks that release memory, and be willing to hold off a bit on tasks that increase net data".

This is still true. However in this example, data was only 200GiB, which is a reasonable size for a cluster (mine had exactly 200GiB total memory). So this still won't work on your laptop. But on a cluster, when we place the initial chunks of data well so they don't have to be shuffled around later, we can actually compute this quite efficiently.

@dougiesquire's climactic mean example from https://github.com/dask/distributed/issues/2602#issuecomment-535009454

20-worker cluster with 2-CPU, 20GiB workers

main + MALLOC_TRIM_THRESHOLD_=0: gave up after 30min and 1.25TiB spilled to disk (!!)
4967 + MALLOC_TRIM_THRESHOLD_=0: 230s

FWIW, even with the colocation PR, this struggled when workers only had 8GiB of memory. Bumping to 20GiB smoothed things out (well, not for main). It would still be really cool if dask/distributed could tell you what amount of memory per worker was necessary to compute a given graph.

dougiesquire commented 3 years ago

Wow, this is really terrific @gjoseph92 ! Thanks heaps. Looking forward to giving it a spin.

Thomas-Moore-Creative commented 3 years ago

@gjoseph92 - that left-hand dashboard looks sadly familiar but the one on the right looks really exciting! Fantastic.

It would still be really cool if dask/distributed could tell you what amount of memory per worker was necessary to compute a given graph.

+1 @mrocklin how much crowd-sourcing do you think we need to make that happen? ;-)

RichardScottOZ commented 3 years ago

So 200GiB problem required 400GiB cluster memory to run smoothly? Seems like what happens generally from experience?

Be interesting to try that same test in the single large machine case - e.g. run the above ona 512GB machine what happens?

mrocklin commented 3 years ago

@mrocklin how much crowd-sourcing do you think we need to make that happen? ;-)

Have you filled out the dask survey yet? dask.org/survey

A lot of what we're working on now is motivated by responses there. Although this work itself comes from generous sponsorship from @rabernat and crew.

Thomas-Moore-Creative commented 3 years ago

(hopefully) obviously it was a cheeky comment @mrocklin - but your response got me thinking about how other orgs ( beyond @rabernat et al ) might try to help resource these efforts.

Thanks for all the hard work from all contributors. Going to hit that survey now.

mrocklin commented 3 years ago

Tell all your friends!

sjperkins commented 3 years ago

Really well done by all involved!

JSKenyon commented 3 years ago

This does look like a huge improvement! I look forward to trying it out on my problems.

dcherian commented 3 years ago

EXciting!

This is essential, because the small slice of f we were returning was just a view of f's memory. And that meant all of f's memory had to stick around—so the memory-reducing function was not reducing memory at all! This is unintuitive behavior, but I wonder how often it happens unknowingly with Dask, and if it plays a role in real-world cases. I wonder if we could add logic to Dask to warn you of this situation, where the memory backing an array result is, say, 2x larger than memory needed for the number of elements in that array, and suggest you make a copy.

@gjoseph92 This happens all the time! See https://github.com/dask/dask/issues/3595 (especially https://github.com/dask/dask/issues/3595#issuecomment-449544992) @mrocklin proposed a solution here: https://github.com/dask/dask/issues/3595 . I use the map_blocks(np.copy) trick frequently

gjoseph92 commented 3 years ago

@dcherian thanks for the xref! https://github.com/dask/dask/issues/3595#issuecomment-451355007 is exactly what I was thinking. (It wouldn't help with user code in map_blocks like this case specifically, but since that's advanced, maybe that's okay.) I might like to take this on.

mrocklin commented 3 years ago

From a Coiled resourcing perspective, +1 on spending time on the getitem fix. That seems cheap to implement and medium-high value.

On Wed, Jun 30, 2021 at 9:32 AM Gabe Joseph @.***> wrote:

@dcherian https://github.com/dcherian thanks for the xref! dask/dask#3595 (comment) https://github.com/dask/dask/issues/3595#issuecomment-451355007 is exactly what I was thinking. (It wouldn't help with user code in map_blocks like this case specifically, but since that's advanced, maybe that's okay.) I might like to take this on.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/distributed/issues/2602#issuecomment-871556900, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTBBHQSYYGDLSQQDCITTVNBJRANCNFSM4HEYFY2Q .

dask / distributed

an example that shows the need for memory backpressure #2602

4892: 3.3 minutes, no spilling, reasonable memory usage

4892 (performance report):

@rabernat's canonical anomaly-mean example from https://github.com/dask/distributed/issues/2602#issuecomment-498718651

4967: 155.39s (4x speedup)

4967 + `MALLOC_TRIM_THRESHOLD_=0`: 93.42s (6x speedup)

@dougiesquire's climactic mean example from https://github.com/dask/distributed/issues/2602#issuecomment-535009454

4967 + `MALLOC_TRIM_THRESHOLD_=0`: 230s

dask / distributed

an example that shows the need for memory backpressure #2602

4892: 3.3 minutes, no spilling, reasonable memory usage

4892 (performance report):

@rabernat's canonical anomaly-mean example from https://github.com/dask/distributed/issues/2602#issuecomment-498718651

4967: 155.39s (4x speedup)

4967 + MALLOC_TRIM_THRESHOLD_=0: 93.42s (6x speedup)

@dougiesquire's climactic mean example from https://github.com/dask/distributed/issues/2602#issuecomment-535009454

4967 + MALLOC_TRIM_THRESHOLD_=0: 230s

4967 + `MALLOC_TRIM_THRESHOLD_=0`: 93.42s (6x speedup)

4967 + `MALLOC_TRIM_THRESHOLD_=0`: 230s