an example that shows the need for memory backpressure

rabernat commented 5 years ago

In my work with large climate datasets, I often concoct calculations that cause my dask workers to run out of memory, start dumping to disk, and eventually grind my computation to a halt. There are many ways to mitigate this by e.g. using more workers, more memory, better disk-spilling settings, simpler jobs, etc. and these have all been tried over the years with some degree of success. But in this issue, I would like to address what I believe is the root of my problems within the dask scheduler algorithms.

The core problem is that the tasks early in my graph generate data faster than it can be consumed downstream, causing data to pile up, eventually overwhelming my workers. Here is a self contained example:

import dask.array as dsa

# create some random data
# assume chunk structure is not under my control, because it originates
# from the way the data is laid out in the underlying files
shape = (500000, 100, 500)
chunks = (100, 100, 500)
data = dsa.random.random(shape, chunks=chunks)

# now rechunk the data to permit me to do some computations along different axes
# this aggregates chunks along axis 0 and dis-aggregates along axis 1
data_rc = data.rechunk((1000, 1, 500))
FACTOR = 15

def my_custom_function(f):
    # a pretend custom function that would do a bunch of stuff along
    # axis 0 and 2 and then reduce the data heavily
    return f.ravel()[::15][None, :]

# apply that function to each chunk
c1 = math.ceil(data_rc.ravel()[::FACTOR].size / c0)
res = data_rc.map_blocks(my_custom_function, dtype=data.dtype,
                         drop_axis=[1, 2], new_axis=[1], chunks=(1, c1))

res.compute()

(Perhaps this could be simplified further, but I have done my best to preserve the basic structure of my real problem.)

When I watch this execute on my dashboard, I see the workers just keep generating data until they reach their memory thresholds, at which point they start writing data to disk, before my_custom_function ever gets called to relieve the memory buildup. Depending on the size of the problem and the speed of the disks where they are spilling, sometimes we can recover and manage to finish after a very long time. Usually the workers just stop working.

This fail case is frustrating, because often I can achieve a reasonable result by just doing the naive thing:

for n in range(500):
    res[n].compute()

and evaluating my computation in serial.

I wish the dask scheduler knew to stop generating new data before the downstream data could be consumed. I am not an expert, but I believe the term for this is backpressure. I see this term has come up in https://github.com/dask/distributed/issues/641, and also in this blog post by @mrocklin regarding streaming data.

I have a hunch that resolving this problem would resolve many of the pervasive but hard-to-diagnose problems we have in the xarray / pangeo sphere. But I also suspect it is not easy and requires major changes to core algorithms.

Dask version 1.1.4

mrocklin commented 3 years ago

I'm closing this out. If folks have other issues here though then I'm happy to reopen.

@rabernat , thank you for opening this originally. My apologies that it took so long to come to a good solution.

rabernat commented 3 years ago

Which dask release do users need to be running to get this feature? My impression is 2021.07.0, correct?

mrocklin commented 3 years ago

I think anything 2021.07 or greater should work. We're also releasing 2021.07.2 today if you wanted to wait a day :)

On Fri, Jul 30, 2021 at 6:12 AM Ryan Abernathey @.***> wrote:

Which dask release do users need to be running to get this feature?

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/dask/distributed/issues/2602#issuecomment-889821511, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTCMKTQFEGPRUEWISCTT2KCJRANCNFSM4HEYFY2Q .

rabernat commented 3 years ago

I am working on a blog post to advertise the exciting progress we have made on this problem. As part of this, I have created an example notebook that can be run in Pangeo Binder (including with Dask Gateway) with two different versions of Dask which attempts to reproduce the issue and its resolution.

New Dask version (2021.07.1): https://binder.pangeo.io/v2/gh/pangeo-gallery/default-binder/39f7202?urlpath=git-pull%3Frepo%3Dhttps%253A%252F%252Fgist.github.com%252Frabernat%252F39d8b6a396e076d168c24167b8871c4b%26urlpath%3Dtree%252F39d8b6a396e076d168c24167b8871c4b%252Fanomaly_std.ipynb%26branch%3Dmaster

Older Dask version (2020.12.1): https://binder.pangeo.io/v2/gh/pangeo-gallery/default-binder/b8d1c53?urlpath=git-pull%3Frepo%3Dhttps%253A%252F%252Fgist.github.com%252Frabernat%252F39d8b6a396e076d168c24167b8871c4b%26urlpath%3Dtree%252F39d8b6a396e076d168c24167b8871c4b%252Fanomaly_std.ipynb%26branch%3Dmaster

The notebook uses the following cluster settings, which is the biggest cluster we can reasonably share publicly on Pangeo Binder:

nworkers = 30
worker_memory = 8
worker_cores = 1

It includes both the "canonical anomaly-mean example example" with synthetic data from https://github.com/dask/distributed/issues/2602#issuecomment-498718651 (referenced by @gjoseph92 in https://github.com/dask/distributed/issues/2602#issuecomment-870930134) as well as a similar real-world example that uses cloud data.

I have found that the critical performance issues are largely resolved even in Dask 2020.12.1 when I use options.environment = {"MALLOC_TRIM_THRESHOLD_": "0"}. For the synthetic data example, the older Dask version (pre #4967) is just a few s slower. For the real-world data example, the new version is significantly faster (2 min vs. 3 min), but nowhere close to the 6x increases reported above. But if I don't set MALLOC_TRIM_THRESHOLD_, both clusters crash. This leads me to conclude that, for these workloads, MALLOC_TRIM_THRESHOLD_ is much more important than #4967.

My questions for the group are:

Is this interpretation correct? Or am I missing something?
If it is correct that MALLOC_TRIM_THRESHOLD_ makes the biggest differences for these climatology-anomaly-style workloads, is there an alternative workflow that would better highlight the improvements of #4967 for the blog post?
Is there a downside to setting MALLOC_TRIM_THRESHOLD_=0 on all of our clusters?

(edit below)

@dougiesquire's climactic mean example from #2602 (comment)

20-worker cluster with 2-CPU, 20GiB workers

main + MALLOC_TRIM_THRESHOLD_=0: gave up after 30min and 1.25TiB spilled to disk (!!)

Co-assign root-ish tasks #4967 + MALLOC_TRIM_THRESHOLD_=0: 230s

I am now trying to run this example, which should presumably reproduce the issue better.

rabernat commented 3 years ago

Latest edit:

I ran a slightly modified version of the "dougiesquire's climactic mean example" and added it to the gist. The only real change I made was to also reduce over the ensemble dimension in order to reduce the total size of the final result--otherwise you end up with a 20GB array that can't fit into the notebook memory.

climactic mean example code

```python import pandas as pd size = (28, 237, 48, 21, 90, 144) chunks = (1, 1, 48, 21, 90, 144) arr = dsa.random.random(size, chunks=chunks) arr items = dict( ensemble = np.arange(size[0]), init_date = pd.date_range(start='1960', periods=size[1]), lat = np.arange(size[2]).astype(float), lead_time = np.arange(size[3]), level = np.arange(size[4]).astype(float), lon = np.arange(size[5]).astype(float), ) dims, coords = zip(*list(items.items())) array = xr.DataArray(arr, coords=coords, dims=dims) dset = xr.Dataset({'data': array}) display(dset) # reduce of "init_date" and "ensemble" to reduce final memory requirements. result = dset['data'].groupby("init_date.month").mean(dim=["init_date", "ensemble"]) %time result.compute() ```

Using MALLOC_TRIM_THRESHOLD_=0 and comparing Dask 2020.12.0 vs. 2021.07.1, I found 6min 49s vs. 4min 13s. This is definitely an improvement, but very different from @gjoseph92's results ("main + MALLOC_TRIMTHRESHOLD=0: gave up after 30min and 1.25TiB spilled to disk (!!)")

So I can definitely see evidence of an incremental improvement, but I feel like I'm still missing something.

RichardScottOZ commented 3 years ago

Thanks Ryan. Have not tested this on anything seriously large, but hopefully will soon.

dcherian commented 3 years ago

but I feel like I'm still missing something.

This thread is a confusing mix of many issues. The slice vs copy issue is real, but doesn't affect groupby problems since xarray indexes out groups with a list-of-ints (always a copy) [except for resampling which uses slices].

result = dset['data'].groupby("init_date.month")

This example (from @dougiesquire I think) has chunksize=1 along init_date which is daily frequency. Xarray's groupby construct a nice graph for this case so it executes well (it'll pick out all chunks with january data and apply dask.array.mean; this is embarrassingly parallel for different groups). If it doesn't execute well, that's up to dask+distributed to schedule it properly. Note that the writeup in https://github.com/ocean-transport/coiled_collaboration/issues/17 only discusses this example.

The earlier example:

cat = intake.Catalog('https://raw.githubusercontent.com/pangeo-data/pangeo-datastore/master/intake-catalogs/ocean.yaml')
ds = cat.GODAS.to_dask()
print(ds)
salt_clim = ds.salt.groupby('time.month').mean(dim='time')

`ds.salt` has chunksize=4 along time with a monthly mean in each timestep so there are 4 months in a block Array	Chunk
11.31 GB	96.08 MB
(471, 40, 417, 360)	(4, 40, 417, 360)
119 Tasks	118 Chunks
float32	numpy.ndarray

So groupby("time.month") splits every block into 4 and we end up with a quadratic-ish shuffling-type workload that tends to work poorly (though I haven't tried this on latest dask)

The best way to reduce the GODAS dataset should be this strategy: https://flox.readthedocs.io/en/latest/implementation.html#method-cohorts i.e. we index out months 1-4 which always occur together and map-reduce that. Repeat for all other cohorts (months 5-8, 9-12) and then concatenate for the final result.

One could test this with flox.xarray.xarray_reduce(ds.salt, ds.time.dt.month, func="mean", method="cohorts") and potentially throw in engine="numba" for extra fun =) Who wants to try it?!

gjoseph92 commented 2 years ago

Though this issue is closed, I imagine that most of you following it are still interested in memory usage in dask, and might still be having problems with it. It's possible that closing this with https://github.com/dask/distributed/pull/4967 was premature, but I'm hoping that https://github.com/dask/distributed/pull/6614 actually addresses the core problem in this issue.

If you are (or aren't) having problems with workloads running out of memory, please try out the new worker-saturation config option!

Information on how to set it is here. And please comment on the discussion to share how it goes:

https://github.com/dask/distributed/discussions/7128

These are benchmarking results showing significant reductions in peak memory use:

@rabernat's original example from this issue uses 50% less memory (but is slower)
@TomNicholas's https://github.com/dask/distributed/issues/6571 uses 50% less memory (and is faster since it doesn't spill)
@JSKenyon's example requiring co-assignment uses 30% less memory. But since this loses co-assignment, it also is slower.

We're especially interested in hearing how the runtime-vs-memory tradeoff feels to people. So please try it out and report back!

Lastly, thank you all for your collaboration and persistence in working on these issues. It's frustrating when you need to get something done, and dask isn't working for you. The examples everyone has shared here have been invaluable in working towards a solution, so thanks to everyone who's taken their time to keep engaging on this.

RichardScottOZ commented 1 year ago

Thanks @gjoseph92

dask / distributed

an example that shows the need for memory backpressure #2602

@dougiesquire's climactic mean example from #2602 (comment)