Closed rsignell closed 4 months ago
Someone other than me would need to look to give you a full answer, but I do see that memory util was very high on a lot of the workers, and this can make many things behave poorly. I see a lot of worker restarts, probably caused by very high memory.
@rsignell: Thanks for reporting this problem! I'm looking into it and will get back to you soon.
@ntabris, ah fascinating! Yes, I'm sure that's it -- I was experimenting with making the chunks bigger and bigger, which of course uses more memory. So there is a nice sweet spot here -- using 20 time steps in a chunk performs about the same as 30 time steps per chunk, but of course has a smaller memory footprint, small enough to fit comfortably within the small cheap instance type. Very cool!
@rsignell: There are two issues with this cluster:
2024.6.1
. Please upgrade to Dask to >=2024.6.1
to avoid this in the future. [^1][^1]: I'm adding your failure scenario to your test suite to make sure this continues to work in future versions: https://github.com/dask/distributed/pull/8769
Thanks @hendrikmakait ! This was a great learning experience for me. And I've updated the libraries on my client side to upgrade to dask>=2024.7.0
( will be nice to also not get the mismatch warnings)
I was doing some testing of loading data using Coiled into xarray with different dask chunking specified
https://cloud.coiled.io/clusters/525992/account/esip-lab/information?workspace=esip-lab
and for some reason the cluster just stopped working partway through the notebook -- it didn't die, but just stopped processing tasks. The cluster stayed alive for a hour before I realized it wasn't doing anything:
I killed the cluster, but I've love to know why it stopped working -- to hopefully avoid this situation again!