Look into unresponsive event loop error in Dask

We are seeing CancelledError exceptions in our Dask workflows that are halting execution of some processes. When these occur, we find messages of the following sort in the worker logs:

2022-08-08 18:03:16,912 - distributed.core - INFO - Event loop was unresponsive in Worker for 7.53s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.

SO posts like this and this give a sense that there may be a timeout threshold that can be adjusted to alleviate (though, I'm sure, not solve) the problem.

An immediate fix to this problem is suggested in settings to dask.distributed, namely distributed.admin.tick.limit (also settable via tick-maximum-delay via the Dask config.yaml). However, since these configurations are not obviously exposed for dask-gateway, some amount of investigation will need to take place in order to set the threshold correctly (and be sure that the property is set, in case the root problem continues).

These are fixes to proximal issues, however, and the true purpose of this issue is to investigate what operations we are undertaking that are causing the initial CancelledError exceptions.

[ ] Provide an intermediate fix for the CancelledErrors
[ ] Determine the root cause of these exceptions

azavea / noaa-hydro-data

Look into unresponsive event loop error in Dask #91