E3SM-Project / e3sm_to_cmip

Tools to CMORize E3SM output
https://e3sm-to-cmip.readthedocs.io/en/latest/
MIT License
7 stars 7 forks source link

Update `mpas.open_mfdataset()` to use `lock=False` #249

Closed tomvothecoder closed 8 months ago

tomvothecoder commented 8 months ago

Description

Checklist

If applicable:

Related Info

Related lines of code -- notice how the dask config is set to threads. https://github.com/E3SM-Project/e3sm_to_cmip/blob/56b2d40b928f7f6fdc59ef6709813bfc05ba862b/e3sm_to_cmip/mpas.py#L306-L325

From @TonyB9000 email, 3/13/24 at 1:07PM:

Hangs here:

2024-03-12 23:46:18,545 [INFO]: siv.py(handle:48) >> Starting siv
2024-03-12 23:46:18,545 [INFO]: siv.py(handle:48) >> Starting siv
2024-03-12 23:46:18,545_545:INFO:handle:Starting siv
2024-03-12 23:47:29,946 [INFO]: siv.py(handle:72) >> Calling mpas.remap for siv
2024-03-12 23:47:29,946 [INFO]: siv.py(handle:72) >> Calling mpas.remap for siv
2024-03-12 23:47:29,946_946:INFO:handle:Calling mpas.remap for siv
2024-03-12 23:47:29,947 [INFO]: mpas.py(remap:83) >> DBG: mpas.py: entered remap()
2024-03-12 23:47:29,947 [INFO]: mpas.py(remap:83) >> DBG: mpas.py: entered remap()
2024-03-12 23:47:29,947_947:INFO:remap:DBG: mpas.py: entered remap()
2024-03-12 23:47:29,952 [INFO]: mpas.py(remap:93) >> DBG: mpas.py: remap() calling write_netcdf()
2024-03-12 23:47:29,952 [INFO]: mpas.py(remap:93) >> DBG: mpas.py: remap() calling write_netcdf()
2024-03-12 23:47:29,952_952:INFO:remap:DBG: mpas.py: remap() calling write_netcdf()

Googling “xarray dataset to_netcdf hangs randomly” leads to

https://github.com/pydata/xarray/issues/4710

“Most of the time, this command works just fine. But in 30% of the cases, this would just... stop and stall. One or more of the workers would simply stop working without coming back or erroring.”

and then:

        https://github.com/pydata/xarray/issues/3961
# If you set lock=False then this runs fine every time.
# Setting lock=None causes it to intermittently hang on mfd.to_netcdf
with xr.open_mfdataset(['dataset.nc'], combine='by_coords', lock=None) as mfd:
     p = os.path.join('tmp', 'xarray_{}.nc'.format(uuid.uuid4().hex))
     print(f"Writing data to {p}")
     mfd.to_netcdf(p)
     print("complete")

If you run this once, it's typically fine. But run it over and over again in a loop, and it'll eventually hang on mfd.to_netcdf. However if I set lock=False then it runs fine every time.

It seems related to a discussion regarding whether HDF5 is/is-not thread-safe, and whether locking is-not/is necessary, respectively.

Many claim that explicitly setting “lock=False” will work. There may be an occasional error thrown (better than hanging forever), and some mitigate by adding a “sleep – 1sec” somewhere (but that could add hours to the processing of each dataset.)