azavea / noaa-hydro-data

NOAA Phase 2 Hydrological Data Processing
11 stars 3 forks source link

Look into unresponsive event loop error in Dask #91

Closed jpolchlo closed 2 years ago

jpolchlo commented 2 years ago

We are seeing CancelledError exceptions in our Dask workflows that are halting execution of some processes. When these occur, we find messages of the following sort in the worker logs:

2022-08-08 18:03:16,912 - distributed.core - INFO - Event loop was unresponsive in Worker for 7.53s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.

SO posts like this and this give a sense that there may be a timeout threshold that can be adjusted to alleviate (though, I'm sure, not solve) the problem.

An immediate fix to this problem is suggested in settings to dask.distributed, namely distributed.admin.tick.limit (also settable via tick-maximum-delay via the Dask config.yaml). However, since these configurations are not obviously exposed for dask-gateway, some amount of investigation will need to take place in order to set the threshold correctly (and be sure that the property is set, in case the root problem continues).

These are fixes to proximal issues, however, and the true purpose of this issue is to investigate what operations we are undertaking that are causing the initial CancelledError exceptions.

jpolchlo commented 2 years ago

This is most likely the result of insufficient RAM leading to unhelpful error messages. Unless and until we see CancelledErrors coming up again, we can close this.