MHW package `xmhw` falls over on large datasets. Can't handle lat/lon/depth arrays

reillyja commented 5 months ago

Attempting to simply plug a 4D ocean temperature array into xmhw will likely not work. This package is designed to "daskify" the MHW Analysis by unravelling a 2D/3D dataset, distributing the grid cells among the workers, and computing the climatology/threshold on as many grid cells as can fit into the jobs memory at any one time (as I understand).

However, when an array exceeds a certain size threshold (~100MB in the demo notebook on 16 CPUs), there are memory leaks / unmanaged memory issues, before workers are killed and the job crashes. There is a workaround - shown in the mhw_pbs.py script, where we divide the array into chunks (say < 100MB) to avoid the memory leaks, and subsequently loop through these chunks. However this leads to an "embarrassingly parallel" job becoming an iterative process once more.

Question: Is it possible we could use another dask.delayed approach to parallelise this outer for loop? I should note that xmhw already uses dask.delayed to distribute the tasks and I believe I read somewhere that using nested delayed functions isn't best practice. But to me it makes some sense.

I think I'll try to determine the threshold array size for a single CPU that will avoid memory issues - probably somewhere < 10MB - which to me is surprising given the 4GB available memory per CPU on gadi (How can a computation like this blow up from 10MB to approach 4 GB? especially when chunking is optimized (i think)).

Thomas-Moore-Creative commented 5 months ago

Thanks for opening this issue @reillyja. I'll have a look later today. Meanwhile: @pearseb I'm not sure but this might relate to the work you are doing today?

reillyja commented 5 months ago

A bit more info:

On a single CPU, an array of size (65 x 2 x 9982) [lon x lat x time] equates to 5.19 MB and is processed in about 9 minutes.
- This appears to be the max array size before unmanaged memory causes a crash
The same array on two CPUs is processed in 3.65 minutes

Conclusion: Doubling the CPUs more than halved the computation time (surprising and good news!)

Remaining Question: Does the array size limit scale linearly with CPUs? i.e., is the max array size for 8 CPUs 8 times 5.19MB?

Thomas-Moore-Creative commented 5 months ago

@reillyja - this is likely an aside but from that quick chat.

The default local cluster on ARE XX-Large would give me 7 x 32GB workers. The following code gives me 4 X 56GB workers on the `normalbw' queue XX-Large:

import dask
from dask.distributed import Client, LocalCluster

cluster = LocalCluster(n_workers=4)
client = Client(cluster)

CleanShot 2024-04-17 at 15 44 07@2x

Thomas-Moore-Creative / shared_sandbox

MHW package `xmhw` falls over on large datasets. Can't handle lat/lon/depth arrays #1