Open douglatornell opened 8 months ago
This appears to be due to the way the netCDF4 library behaves when dask workers use multiple threads. It has been described as a thread-safety issue (https://github.com/xCDAT/xcdat/issues/561#issuecomment-1969470260) and a file locking issue (https://github.com/pydata/xarray/discussions/8925#discussioncomment-9139316). In any case, the solution is to set the number of threads per dask worker to 1.
When I changed the persistent dask cluster on salish to use dask worker ... --nthreads=1 ...
the problem disappeared for the SalishSeaNowcast make_averaged_dataset
worker.
Intermittently, the SalishSeaNowcast
make_averaged_dataset
worker fails with a KeyError for one of the model variables in the dataset it is writing. Example traceback below.I suspect that this might have to do with an issue I read a while ago where it is suspected that opened dataset files get closed before dask is finished with them (perhaps due to calling
xarray.open_mfdataset()
in a context manager?)If this can't be resolved in Reshapr, it should at least be handled as a critical error in the
make_averaged_dataset
worker so that the worker logs a critical message instead of the worker failing.Example traceback: