“Most of the time, this command works just fine. But in 30% of the cases, this would just... stop and stall. One or more of the workers would simply stop working without coming back or erroring.”
and then:
https://github.com/pydata/xarray/issues/3961
# If you set lock=False then this runs fine every time.
# Setting lock=None causes it to intermittently hang on mfd.to_netcdf
with xr.open_mfdataset(['dataset.nc'], combine='by_coords', lock=None) as mfd:
p = os.path.join('tmp', 'xarray_{}.nc'.format(uuid.uuid4().hex))
print(f"Writing data to {p}")
mfd.to_netcdf(p)
print("complete")
If you run this once, it's typically fine. But run it over and over again in a loop, and it'll eventually hang on mfd.to_netcdf. However if I set lock=False then it runs fine every time.
It seems related to a discussion regarding whether HDF5 is/is-not thread-safe, and whether locking is-not/is necessary, respectively.
Many claim that explicitly setting “lock=False” will work. There may be an occasional error thrown (better than hanging forever), and some mitigate by adding a “sleep – 1sec” somewhere (but that could add hours to the processing of each dataset.)
Description
to_netcdf()
hanging. Refer to "Related Info" section below for more info.Checklist
If applicable:
Related Info
Related lines of code -- notice how the dask config is set to threads. https://github.com/E3SM-Project/e3sm_to_cmip/blob/56b2d40b928f7f6fdc59ef6709813bfc05ba862b/e3sm_to_cmip/mpas.py#L306-L325
From @TonyB9000 email, 3/13/24 at 1:07PM:
Hangs here: