Open reillyja opened 5 months ago
Thanks for opening this issue @reillyja. I'll have a look later today. Meanwhile: @pearseb I'm not sure but this might relate to the work you are doing today?
A bit more info:
Conclusion: Doubling the CPUs more than halved the computation time (surprising and good news!)
Remaining Question: Does the array size limit scale linearly with CPUs? i.e., is the max array size for 8 CPUs 8 times 5.19MB?
@reillyja - this is likely an aside but from that quick chat.
The default local cluster on ARE XX-Large would give me 7 x 32GB workers. The following code gives me 4 X 56GB workers on the `normalbw' queue XX-Large:
import dask
from dask.distributed import Client, LocalCluster
cluster = LocalCluster(n_workers=4)
client = Client(cluster)
Attempting to simply plug a 4D ocean temperature array into
xmhw
will likely not work. This package is designed to "daskify" the MHW Analysis by unravelling a 2D/3D dataset, distributing the grid cells among the workers, and computing the climatology/threshold on as many grid cells as can fit into the jobs memory at any one time (as I understand).However, when an array exceeds a certain size threshold (~100MB in the demo notebook on 16 CPUs), there are memory leaks / unmanaged memory issues, before workers are killed and the job crashes. There is a workaround - shown in the mhw_pbs.py script, where we divide the array into chunks (say < 100MB) to avoid the memory leaks, and subsequently loop through these chunks. However this leads to an "embarrassingly parallel" job becoming an iterative process once more.
Question: Is it possible we could use another
dask.delayed
approach to parallelise this outer for loop? I should note thatxmhw
already usesdask.delayed
to distribute the tasks and I believe I read somewhere that using nesteddelayed
functions isn't best practice. But to me it makes some sense.I think I'll try to determine the threshold array size for a single CPU that will avoid memory issues - probably somewhere < 10MB - which to me is surprising given the 4GB available memory per CPU on gadi (How can a computation like this blow up from 10MB to approach 4 GB? especially when chunking is optimized (i think)).