Sleuthing why Cluster stopped working

coiled / feedback

A place to provide Coiled feedback

14 stars 3 forks source link

Sleuthing why Cluster stopped working #287

Closed rsignell closed 4 months ago

rsignell commented 4 months ago

I was doing some testing of loading data using Coiled into xarray with different dask chunking specified

https://cloud.coiled.io/clusters/525992/account/esip-lab/information?workspace=esip-lab

and for some reason the cluster just stopped working partway through the notebook -- it didn't die, but just stopped processing tasks. The cluster stayed alive for a hour before I realized it wasn't doing anything:

Screenshot 2024-07-15 085723

I killed the cluster, but I've love to know why it stopped working -- to hopefully avoid this situation again!

ntabris commented 4 months ago

Someone other than me would need to look to give you a full answer, but I do see that memory util was very high on a lot of the workers, and this can make many things behave poorly. I see a lot of worker restarts, probably caused by very high memory.

hendrikmakait commented 4 months ago

@rsignell: Thanks for reporting this problem! I'm looking into it and will get back to you soon.

rsignell commented 4 months ago

@ntabris, ah fascinating! Yes, I'm sure that's it -- I was experimenting with making the chunks bigger and bigger, which of course uses more memory. So there is a nice sweet spot here -- using 20 time steps in a chunk performs about the same as 30 time steps per chunk, but of course has a smaller memory footprint, small enough to fit comfortably within the small cheap instance type. Very cool!

hendrikmakait commented 4 months ago

@rsignell: There are two issues with this cluster:

As @ntabris already pointed out, your workers ran into memory issues in your later experiments. This caused workers to get repeatedly paused or killed because they ran completely out of memory. This slows down your computations and might eventually fail them if chunks deterministically don't fit into memory, but it shouldn't deadlock. To avoid this, I'd recommend choosing chunk sizes that fit comfortably into memory.
However, the worker restarts have triggered a known deadlock (https://github.com/dask/distributed/issues/8702) that has been fixed in 2024.6.1. Please upgrade to Dask to >=2024.6.1 to avoid this in the future. [^1]

[^1]: I'm adding your failure scenario to your test suite to make sure this continues to work in future versions: https://github.com/dask/distributed/pull/8769

rsignell commented 4 months ago

Thanks @hendrikmakait ! This was a great learning experience for me. And I've updated the libraries on my client side to upgrade to dask>=2024.7.0 ( will be nice to also not get the mismatch warnings)