E3SM-Project / e3sm_to_cmip

Tools to CMORize E3SM output
https://e3sm-to-cmip.readthedocs.io/en/latest/
MIT License
7 stars 7 forks source link

[Bug]: Intermittent hang in "to_netcdf" call from write_netcdf. #250

Closed TonyB9000 closed 6 months ago

TonyB9000 commented 8 months ago

What happened?

Occasional hang-forever on "to_netcdf". No errors or exceptions raised.

What did you expect to happen? Are there are possible answers you came across?

Similar behaviors have been noted in:

https://github.com/pydata/xarray/issues/4710

“Most of the time, this command works just fine. But in 30% of the cases, this would just... stop and stall. One or more of the workers would simply stop working without coming back or erroring.”

and then:

        https://github.com/pydata/xarray/issues/3961
# If you set lock=False then this runs fine every time.
# Setting lock=None causes it to intermittently hang on mfd.to_netcdf
with xr.open_mfdataset(['dataset.nc'], combine='by_coords', lock=None) as mfd:
     p = os.path.join('tmp', 'xarray_{}.nc'.format(uuid.uuid4().hex))
     print(f"Writing data to {p}")
     mfd.to_netcdf(p)
     print("complete")

If you run this once, it's typically fine. But run it over and over again in a loop, and it'll eventually hang on mfd.to_netcdf. However if I set lock=False then it runs fine every time.

It seems related to a discussion regarding whether HDF5 is/is-not thread-safe, and whether locking is-not/is necessary, respectively.

Minimal Complete Verifiable Example (MVCE)

Good luck with that.

Relevant log output

(duplicates manually elided for ease of reading):

2024-03-13 18:05:11,989 [INFO]: siv.py(handle:48) >> Starting siv
2024-03-13 18:05:11,989_989:INFO:handle:Starting siv
2024-03-13 18:06:25,504 [INFO]: mpas.py(remap:83) >> DBG: mpas.py: entered remap()
2024-03-13 18:06:25,504_504:INFO:remap:DBG: mpas.py: entered remap()
2024-03-13 18:06:25,508 [INFO]: mpas.py(remap:93) >> DBG: mpas.py: remap() calling write_netcdf()
2024-03-13 18:06:25,508_508:INFO:remap:DBG: mpas.py: remap() calling write_netcdf()
2024-03-13 18:06:25,509 [INFO]: mpas.py(write_netcdf:358) >> DBG: mpas.py: write_netcdf() entered.
2024-03-13 18:06:25,509_509:INFO:write_netcdf:DBG: mpas.py: write_netcdf() entered.
2024-03-13 18:06:25,509 [INFO]: mpas.py(write_netcdf:362) >> DBG: mpas.py: write_netcdf() processing variable timeMonthly_avg_iceAreaCell.
2024-03-13 18:06:25,509_509:INFO:write_netcdf:DBG: mpas.py: write_netcdf() processing variable timeMonthly_avg_iceAreaCell.
2024-03-13 18:06:25,509 [INFO]: mpas.py(write_netcdf:364) >> DBG: mpas.py: write_netcdf() isNumeric = True.
2024-03-13 18:06:25,509_509:INFO:write_netcdf:DBG: mpas.py: write_netcdf() isNumeric = True.
2024-03-13 18:06:25,510 [INFO]: mpas.py(write_netcdf:362) >> DBG: mpas.py: write_netcdf() processing variable siv.
2024-03-13 18:06:25,510_510:INFO:write_netcdf:DBG: mpas.py: write_netcdf() processing variable siv.
2024-03-13 18:06:25,510 [INFO]: mpas.py(write_netcdf:364) >> DBG: mpas.py: write_netcdf() isNumeric = True.
2024-03-13 18:06:25,510_510:INFO:write_netcdf:DBG: mpas.py: write_netcdf() isNumeric = True.
2024-03-13 18:06:25,510 [INFO]: mpas.py(write_netcdf:362) >> DBG: mpas.py: write_netcdf() processing variable time_bnds.
2024-03-13 18:06:25,510_510:INFO:write_netcdf:DBG: mpas.py: write_netcdf() processing variable time_bnds.
2024-03-13 18:06:25,510 [INFO]: mpas.py(write_netcdf:364) >> DBG: mpas.py: write_netcdf() isNumeric = True.
2024-03-13 18:06:25,510_510:INFO:write_netcdf:DBG: mpas.py: write_netcdf() isNumeric = True.
2024-03-13 18:06:25,510 [INFO]: mpas.py(write_netcdf:362) >> DBG: mpas.py: write_netcdf() processing variable time.
2024-03-13 18:06:25,510_510:INFO:write_netcdf:DBG: mpas.py: write_netcdf() processing variable time.
2024-03-13 18:06:25,510 [INFO]: mpas.py(write_netcdf:364) >> DBG: mpas.py: write_netcdf() isNumeric = True.
2024-03-13 18:06:25,510_510:INFO:write_netcdf:DBG: mpas.py: write_netcdf() isNumeric = True.
2024-03-13 18:06:25,511 [INFO]: mpas.py(write_netcdf:374) >> DBG: mpas.py: write_netcdf() calling update_history().
2024-03-13 18:06:25,511_511:INFO:write_netcdf:DBG: mpas.py: write_netcdf() calling update_history().
2024-03-13 18:06:25,511 [INFO]: mpas.py(write_netcdf:377) >> DBG: mpas.py: write_netcdf() calling to_netcdf() (fileName = /p/user_pub/e3sm/bartoletti1/tmp/jkv4m51w/tmp2roov6aa, encodingDict = {'timeMonthly_avg_iceAreaCell': {'_FillValue': 9.969209968386869e+36}, 'siv': {'_FillValue': 9.969209968386869e+36}, 'time_bnds': {'_FillValue': 9.969209968386869e+36}, 'time': {'_FillValue': 9.969209968386869e+36}}, unlimited = time).
2024-03-13 18:06:25,511_511:INFO:write_netcdf:DBG: mpas.py: write_netcdf() calling to_netcdf() (fileName = /p/user_pub/e3sm/bartoletti1/tmp/jkv4m51w/tmp2roov6aa, encodingDict = {'timeMonthly_avg_iceAreaCell': {'_FillValue': 9.969209968386869e+36}, 'siv': {'_FillValue': 9.969209968386869e+36}, 'time_bnds': {'_FillValue': 9.969209968386869e+36}, 'time': {'_FillValue': 9.969209968386869e+36}}, unlimited = time).

Anything else we need to know?

The salient history of local discussion, oldest to newest:

Mar 13, 1:07 PM

Having instrumented remap, the last output lines on a hang are:

2024-03-12 23:46:18,545 [INFO]: siv.py(handle:48) >> Starting siv
2024-03-12 23:46:18,545 [INFO]: siv.py(handle:48) >> Starting siv
2024-03-12 23:46:18,545_545:INFO:handle:Starting siv
2024-03-12 23:47:29,946 [INFO]: siv.py(handle:72) >> Calling mpas.remap for siv
2024-03-12 23:47:29,946 [INFO]: siv.py(handle:72) >> Calling mpas.remap for siv
2024-03-12 23:47:29,946_946:INFO:handle:Calling mpas.remap for siv
2024-03-12 23:47:29,947 [INFO]: mpas.py(remap:83) >> DBG: mpas.py: entered remap()
2024-03-12 23:47:29,947 [INFO]: mpas.py(remap:83) >> DBG: mpas.py: entered remap()
2024-03-12 23:47:29,947_947:INFO:remap:DBG: mpas.py: entered remap()
2024-03-12 23:47:29,952 [INFO]: mpas.py(remap:93) >> DBG: mpas.py: remap() calling write_netcdf()
2024-03-12 23:47:29,952 [INFO]: mpas.py(remap:93) >> DBG: mpas.py: remap() calling write_netcdf()
2024-03-12 23:47:29,952_952:INFO:remap:DBG: mpas.py: remap() calling write_netcdf()

We never return from “write_netcdf()”.

Fully instrumenting “write_netcdf()” and incurring another hang, the tail of the output log reads:

2024-03-13 17:20:43,551 [INFO]: siv.py(handle:48) >> Starting siv
2024-03-13 17:20:43,551 [INFO]: siv.py(handle:48) >> Starting siv
2024-03-13 17:20:43,551_551:INFO:handle:Starting siv
2024-03-13 17:21:55,026 [INFO]: siv.py(handle:72) >> Calling mpas.remap for siv
2024-03-13 17:21:55,026 [INFO]: siv.py(handle:72) >> Calling mpas.remap for siv
2024-03-13 17:21:55,026_026:INFO:handle:Calling mpas.remap for siv
2024-03-13 17:21:55,027 [INFO]: mpas.py(remap:83) >> DBG: mpas.py: entered remap()
2024-03-13 17:21:55,027 [INFO]: mpas.py(remap:83) >> DBG: mpas.py: entered remap()
2024-03-13 17:21:55,027_027:INFO:remap:DBG: mpas.py: entered remap()
2024-03-13 17:21:55,030 [INFO]: mpas.py(remap:93) >> DBG: mpas.py: remap() calling write_netcdf()
2024-03-13 17:21:55,030 [INFO]: mpas.py(remap:93) >> DBG: mpas.py: remap() calling write_netcdf()
2024-03-13 17:21:55,030_030:INFO:remap:DBG: mpas.py: remap() calling write_netcdf()
2024-03-13 17:21:55,030 [INFO]: mpas.py(write_netcdf:358) >> DBG: mpas.py: write_netcdf() entered.
2024-03-13 17:21:55,030 [INFO]: mpas.py(write_netcdf:358) >> DBG: mpas.py: write_netcdf() entered.
2024-03-13 17:21:55,030_030:INFO:write_netcdf:DBG: mpas.py: write_netcdf() entered.
2024-03-13 17:21:55,031 [INFO]: mpas.py(write_netcdf:362) >> DBG: mpas.py: write_netcdf() processing variable timeMonthly_avg_iceAreaCell.
2024-03-13 17:21:55,031 [INFO]: mpas.py(write_netcdf:362) >> DBG: mpas.py: write_netcdf() processing variable timeMonthly_avg_iceAreaCell.
2024-03-13 17:21:55,031_031:INFO:write_netcdf:DBG: mpas.py: write_netcdf() processing variable timeMonthly_avg_iceAreaCell.
2024-03-13 17:21:55,031 [INFO]: mpas.py(write_netcdf:362) >> DBG: mpas.py: write_netcdf() processing variable siv.
2024-03-13 17:21:55,031 [INFO]: mpas.py(write_netcdf:362) >> DBG: mpas.py: write_netcdf() processing variable siv.
2024-03-13 17:21:55,031_031:INFO:write_netcdf:DBG: mpas.py: write_netcdf() processing variable siv.
2024-03-13 17:21:55,031 [INFO]: mpas.py(write_netcdf:362) >> DBG: mpas.py: write_netcdf() processing variable time_bnds.
2024-03-13 17:21:55,031 [INFO]: mpas.py(write_netcdf:362) >> DBG: mpas.py: write_netcdf() processing variable time_bnds.
2024-03-13 17:21:55,031_031:INFO:write_netcdf:DBG: mpas.py: write_netcdf() processing variable time_bnds.
2024-03-13 17:21:55,032 [INFO]: mpas.py(write_netcdf:362) >> DBG: mpas.py: write_netcdf() processing variable time.
2024-03-13 17:21:55,032 [INFO]: mpas.py(write_netcdf:362) >> DBG: mpas.py: write_netcdf() processing variable time.
2024-03-13 17:21:55,032_032:INFO:write_netcdf:DBG: mpas.py: write_netcdf() processing variable time.
2024-03-13 17:21:55,032 [INFO]: mpas.py(write_netcdf:373) >> DBG: mpas.py: write_netcdf() calling update_history().
2024-03-13 17:21:55,032 [INFO]: mpas.py(write_netcdf:373) >> DBG: mpas.py: write_netcdf() calling update_history().
2024-03-13 17:21:55,032_032:INFO:write_netcdf:DBG: mpas.py: write_netcdf() calling update_history().
2024-03-13 17:21:55,032 [INFO]: mpas.py(write_netcdf:376) >> DBG: mpas.py: write_netcdf() calling to_netcdf() (unlimited = time).
2024-03-13 17:21:55,032 [INFO]: mpas.py(write_netcdf:376) >> DBG: mpas.py: write_netcdf() calling to_netcdf() (unlimited = time).
2024-03-13 17:21:55,032_032:INFO:write_netcdf:DBG: mpas.py: write_netcdf() calling to_netcdf() (unlimited = time).

So . . . we are hanging in “to_netcdf()”, a built-in method in the object returned by xarray.Dataset().  Since I cannot penetrate further here, I re-ran this (hung at a different decade, but same behavior) with more information gleaned about the failed/hung call:

(duplicates manually elided for ease of reading):

2024-03-13 18:05:11,989 [INFO]: siv.py(handle:48) >> Starting siv
2024-03-13 18:05:11,989_989:INFO:handle:Starting siv
2024-03-13 18:06:25,504 [INFO]: mpas.py(remap:83) >> DBG: mpas.py: entered remap()
2024-03-13 18:06:25,504_504:INFO:remap:DBG: mpas.py: entered remap()
2024-03-13 18:06:25,508 [INFO]: mpas.py(remap:93) >> DBG: mpas.py: remap() calling write_netcdf()
2024-03-13 18:06:25,508_508:INFO:remap:DBG: mpas.py: remap() calling write_netcdf()
2024-03-13 18:06:25,509 [INFO]: mpas.py(write_netcdf:358) >> DBG: mpas.py: write_netcdf() entered.
2024-03-13 18:06:25,509_509:INFO:write_netcdf:DBG: mpas.py: write_netcdf() entered.
2024-03-13 18:06:25,509 [INFO]: mpas.py(write_netcdf:362) >> DBG: mpas.py: write_netcdf() processing variable timeMonthly_avg_iceAreaCell.
2024-03-13 18:06:25,509_509:INFO:write_netcdf:DBG: mpas.py: write_netcdf() processing variable timeMonthly_avg_iceAreaCell.
2024-03-13 18:06:25,509 [INFO]: mpas.py(write_netcdf:364) >> DBG: mpas.py: write_netcdf() isNumeric = True.
2024-03-13 18:06:25,509_509:INFO:write_netcdf:DBG: mpas.py: write_netcdf() isNumeric = True.
2024-03-13 18:06:25,510 [INFO]: mpas.py(write_netcdf:362) >> DBG: mpas.py: write_netcdf() processing variable siv.
2024-03-13 18:06:25,510_510:INFO:write_netcdf:DBG: mpas.py: write_netcdf() processing variable siv.
2024-03-13 18:06:25,510 [INFO]: mpas.py(write_netcdf:364) >> DBG: mpas.py: write_netcdf() isNumeric = True.
2024-03-13 18:06:25,510_510:INFO:write_netcdf:DBG: mpas.py: write_netcdf() isNumeric = True.
2024-03-13 18:06:25,510 [INFO]: mpas.py(write_netcdf:362) >> DBG: mpas.py: write_netcdf() processing variable time_bnds.
2024-03-13 18:06:25,510_510:INFO:write_netcdf:DBG: mpas.py: write_netcdf() processing variable time_bnds.
2024-03-13 18:06:25,510 [INFO]: mpas.py(write_netcdf:364) >> DBG: mpas.py: write_netcdf() isNumeric = True.
2024-03-13 18:06:25,510_510:INFO:write_netcdf:DBG: mpas.py: write_netcdf() isNumeric = True.
2024-03-13 18:06:25,510 [INFO]: mpas.py(write_netcdf:362) >> DBG: mpas.py: write_netcdf() processing variable time.
2024-03-13 18:06:25,510_510:INFO:write_netcdf:DBG: mpas.py: write_netcdf() processing variable time.
2024-03-13 18:06:25,510 [INFO]: mpas.py(write_netcdf:364) >> DBG: mpas.py: write_netcdf() isNumeric = True.
2024-03-13 18:06:25,510_510:INFO:write_netcdf:DBG: mpas.py: write_netcdf() isNumeric = True.
2024-03-13 18:06:25,511 [INFO]: mpas.py(write_netcdf:374) >> DBG: mpas.py: write_netcdf() calling update_history().
2024-03-13 18:06:25,511_511:INFO:write_netcdf:DBG: mpas.py: write_netcdf() calling update_history().
2024-03-13 18:06:25,511 [INFO]: mpas.py(write_netcdf:377) >> DBG: mpas.py: write_netcdf() calling to_netcdf() (fileName = /p/user_pub/e3sm/bartoletti1/tmp/jkv4m51w/tmp2roov6aa, encodingDict = {'timeMonthly_avg_iceAreaCell': {'_FillValue': 9.969209968386869e+36}, 'siv': {'_FillValue': 9.969209968386869e+36}, 'time_bnds': {'_FillValue': 9.969209968386869e+36}, 'time': {'_FillValue': 9.969209968386869e+36}}, unlimited = time).
2024-03-13 18:06:25,511_511:INFO:write_netcdf:DBG: mpas.py: write_netcdf() calling to_netcdf() (fileName = /p/user_pub/e3sm/bartoletti1/tmp/jkv4m51w/tmp2roov6aa, encodingDict = {'timeMonthly_avg_iceAreaCell': {'_FillValue': 9.969209968386869e+36}, 'siv': {'_FillValue': 9.969209968386869e+36}, 'time_bnds': {'_FillValue': 9.969209968386869e+36}, 'time': {'_FillValue': 9.969209968386869e+36}}, unlimited = time).

Googling “xarray dataset to_netcdf hangs randomly” leads to 

https://github.com/pydata/xarray/issues/4710

“Most of the time, this command works just fine. But in 30% of the cases, this would just... stop and stall. One or more of the workers would simply stop working without coming back or erroring.”

and then:

            https://github.com/pydata/xarray/issues/3961

# If you set lock=False then this runs fine every time.
# Setting lock=None causes it to intermittently hang on mfd.to_netcdf
with xr.open_mfdataset(['dataset.nc'], combine='by_coords', lock=None) as mfd:
     p = os.path.join('tmp', 'xarray_{}.nc'.format(uuid.uuid4().hex))
     print(f"Writing data to {p}")
     mfd.to_netcdf(p)
     print("complete")
If you run this once, it's typically fine. But run it over and over again in a loop, and it'll eventually hang on mfd.to_netcdf. However if I set lock=False then it runs fine every time.

It seems related to a discussion regarding whether HDF5 is/is-not thread-safe, and whether locking is-not/is necessary, respectively.

Many claim that explicitly setting “lock=False” will work.  There may be an occasional error thrown (better than hanging forever), and some mitigate by adding a “sleep – 1sec” somewhere (but that could add hours to the processing of each dataset.)

In any case, here are our calls to  “open_mfdataset”, which make no mention of “lock” at all:

cmor_handlers/mpas_vars/fsitherm.py:    with mpas.open_mfdataset(timeSeriesFiles, variableList) as dsIn:
cmor_handlers/mpas_vars/hfds.py:    with mpas.open_mfdataset(timeSeriesFiles, variableList) as dsIn:
cmor_handlers/mpas_vars/hfsifrazil.py:    with mpas.open_mfdataset(timeSeriesFiles, variableList) as dsIn:
cmor_handlers/mpas_vars/masscello.py:    with mpas.open_mfdataset(timeSeriesFiles, variableList) as dsIn:
cmor_handlers/mpas_vars/masso.py:    with mpas.open_mfdataset(timeSeriesFiles, variableList) as dsIn:
cmor_handlers/mpas_vars/mlotst.py:    with mpas.open_mfdataset(timeSeriesFiles, variableList) as dsIn:
cmor_handlers/mpas_vars/msftmz.py:    with mpas.open_mfdataset(timeSeriesFiles, variableList) as dsIn:
cmor_handlers/mpas_vars/pbo.py:    with mpas.open_mfdataset(timeSeriesFiles, variableList) as dsIn:
cmor_handlers/mpas_vars/pbo.py:    with xarray.open_mfdataset(pslFileNames, concat_dim='time') as dsIn:
cmor_handlers/mpas_vars/pso.py:    with mpas.open_mfdataset(timeSeriesFiles, variableList) as dsIn:
cmor_handlers/mpas_vars/pso.py:    with xarray.open_mfdataset(pslFileNames, concat_dim='time') as dsIn:
cmor_handlers/mpas_vars/sfdsi.py:    with mpas.open_mfdataset(timeSeriesFiles, variableList) as dsIn:
cmor_handlers/mpas_vars/siconc.py:    with mpas.open_mfdataset(timeSeriesFiles, variableList) as dsIn:
cmor_handlers/mpas_vars/simass.py:    with mpas.open_mfdataset(timeSeriesFiles, variableList) as dsIn:
cmor_handlers/mpas_vars/sisnmass.py:    with mpas.open_mfdataset(timeSeriesFiles, variableList) as dsIn:
cmor_handlers/mpas_vars/sisnthick.py:    with mpas.open_mfdataset(timeSeriesFiles, variableList) as dsIn:
cmor_handlers/mpas_vars/sitemptop.py:    with mpas.open_mfdataset(timeSeriesFiles, variableList) as dsIn:
cmor_handlers/mpas_vars/sithick.py:    with mpas.open_mfdataset(timeSeriesFiles, variableList) as dsIn:
cmor_handlers/mpas_vars/sitimefrac.py:    with mpas.open_mfdataset(timeSeriesFiles, variableList) as dsIn:
cmor_handlers/mpas_vars/siu.py:    with mpas.open_mfdataset(timeSeriesFiles, variableList) as dsIn:
cmor_handlers/mpas_vars/siv.py:    with mpas.open_mfdataset(timeSeriesFiles, variableList) as dsIn:
cmor_handlers/mpas_vars/sob.py:    with mpas.open_mfdataset(timeSeriesFiles, variableList) as dsIn:
cmor_handlers/mpas_vars/soga.py:    with mpas.open_mfdataset(timeSeriesFiles, variableList) as dsIn:
cmor_handlers/mpas_vars/so.py:    with mpas.open_mfdataset(timeSeriesFiles, variableList) as dsIn:
cmor_handlers/mpas_vars/sosga.py:    with mpas.open_mfdataset(timeSeriesFiles, variableList) as dsIn:
cmor_handlers/mpas_vars/sos.py:    with mpas.open_mfdataset(timeSeriesFiles, variableList) as dsIn:
cmor_handlers/mpas_vars/tauuo.py:    with mpas.open_mfdataset(timeSeriesFiles, variableList) as dsIn:
cmor_handlers/mpas_vars/tauvo.py:    with mpas.open_mfdataset(timeSeriesFiles, variableList) as dsIn:
cmor_handlers/mpas_vars/thetaoga.py:    with mpas.open_mfdataset(timeSeriesFiles, variableList) as dsIn:
cmor_handlers/mpas_vars/thetao.py:    with mpas.open_mfdataset(timeSeriesFiles, variableList) as dsIn:
cmor_handlers/mpas_vars/thkcello.py:    with mpas.open_mfdataset(timeSeriesFiles, variableList) as dsIn:
cmor_handlers/mpas_vars/tob.py:    with mpas.open_mfdataset(timeSeriesFiles, variableList) as dsIn:
cmor_handlers/mpas_vars/tosga.py:    with mpas.open_mfdataset(timeSeriesFiles, variableList) as dsIn:
cmor_handlers/mpas_vars/tos.py:    with mpas.open_mfdataset(timeSeriesFiles, variableList) as dsIn:
cmor_handlers/mpas_vars/uo.py:    with mpas.open_mfdataset(timeSeriesFiles, variableList) as dsIn:
cmor_handlers/mpas_vars/volcello.py:    with mpas.open_mfdataset(timeSeriesFiles, variableList) as dsIn:
cmor_handlers/mpas_vars/volo.py:    with mpas.open_mfdataset(timeSeriesFiles, variableList) as dsIn:
cmor_handlers/mpas_vars/vo.py:    with mpas.open_mfdataset(timeSeriesFiles, variableList) as dsIn:
cmor_handlers/mpas_vars/wfo.py:    with mpas.open_mfdataset(timeSeriesFiles, variableList) as dsIn:
cmor_handlers/mpas_vars/wo.py:    with mpas.open_mfdataset(timeSeriesFiles, variableList) as dsIn:
cmor_handlers/mpas_vars/zhalfo.py:    with mpas.open_mfdataset(timeSeriesFiles, variableList) as dsIn:
cmor_handlers/mpas_vars/zos.py:    with mpas.open_mfdataset(timeSeriesFiles, variableList) as dsIn:

and in handler.py  (_get_mfdataset)

        ds = xc.open_mfdataset(
            all_filepaths,
            add_bounds=["X", "Y"],
            decode_times=False,
            combine="nested",
            data_vars="minimal",
            coords="minimal",
        )

and in mpas.py:

def open_mfdataset(
    fileNames, variableList=None, chunks={"nCells": 32768, "Time": 6}, daskThreads=6
):

    ds = xarray.open_mfdataset(
        fileNames,
        combine="nested",
        decode_cf=False,
        decode_times=False,
        concat_dim="Time",
        mask_and_scale=False,
        chunks=chunks,
    )

            (IN “remap_seaice_sgs”)
    ds_out_all = xarray.open_mfdataset(
        f"{outFilePath}/temp_out*nc", data_vars="minimal", decode_times=False
    )

            (IN “compute_moc_streamfunction”)
        dsIn = open_mfdataset(args.inFileNames, variableList)

I am just guessing that if “lock” is not mentioned, it defaults to “None” … 

I still have some 6 (historical.r5) SImon datasets to run (and a few hung Omon).  I may edit them to add “lock=False” to their handlers and see what happens . . .
From: Jill Chengzhu Zhang <[zhang40@llnl.gov](mailto:zhang40@llnl.gov)>
Date: Wednesday, March 13, 2024 at 1:46 PM
To: Tony Bartoletti <[bartoletti1@llnl.gov](mailto:bartoletti1@llnl.gov)>, Tom Vo <[vo13@llnl.gov](mailto:vo13@llnl.gov)>
Subject: Re: v2_1 cmorizing issues with SImon (and Omon)

Hi Tony, 

Great work to get to the bottom of the issue. In the fix Tom implemented: https://github.com/E3SM-Project/e3sm_to_cmip/issues/246
With parallel = False, in open_mfdataset(), this make sure that only one thread is used for reading in data files. However, this fix is only applied to atmosphere variables because those handlers are opened with xcdat which has parallel = True as default. While for the individual ocean and sea-ice variables handlers, they use xarray.open_mfdatasets(), the default for parallel is False, in the meantime default for lock is None. 

Tom, maybe you could help us understand this better… In any case, tt seems worthwhile to try lock=False to see if it can solve the hung issue here.  

Thanks,
Jill
Hi Tony and Jill,

RE: However, this fix is only applied to atmosphere variables because those handlers are opened with xcdat which has parallel = True as default.

xCDAT’s `open_mfdataset()` keeps the default Xarray value of `parallel=False`.

RE: In any case, tt seems worthwhile to try lock=False to see if it can solve the hung issue here.  

Sure, I can push a commit that updates `open_mfdataset()` with `lock=False`. I’ll let Tony know when this is done (maybe tomorrow). Then he can test using the latest version of e3sm_to_cmip `main`.

Best,
Tom
From: Vo, Tom <[vo13@llnl.gov](mailto:vo13@llnl.gov)> 
Sent: Wednesday, March 13, 2024 3:33 PM
To: Zhang, Chengzhu Jill <[zhang40@llnl.gov](mailto:zhang40@llnl.gov)>; Bartoletti, Tony <[bartoletti1@llnl.gov](mailto:bartoletti1@llnl.gov)>
Subject: Re: v2_1 cmorizing issues with SImon (and Omon)

I made a one-line change in this [PR](https://github.com/E3SM-Project/e3sm_to_cmip/pull/249) and merged it to e3sm_to_cmip’s `master` branch.

Tony – if you can try to `pip install .`this version into your datasm env to try it out that would be great. If you confirm this fixes the netCDF write hanging, I will make another e3sm_to_cmip v1.11.2 release candidate (rc3).

Best,
Tom
From: "Bartoletti, Tony" <[bartoletti1@llnl.gov](mailto:bartoletti1@llnl.gov)>
Date: Thursday, March 14, 2024 at 9:12 AM
To: "Vo, Tom" <[vo13@llnl.gov](mailto:vo13@llnl.gov)>, "[zhang40@llnl.gov](mailto:zhang40@llnl.gov)" <[zhang40@llnl.gov](mailto:zhang40@llnl.gov)>
Subject: Open_mfdataset not good.

Hi Tom, and Jill,

I tried to make that one change to (def open_mfdataset()) in mpas.py, adding “lock=False” to xarray.open_mfdataset:

    ds = xarray.open_mfdataset(
        fileNames,
        combine="nested",
        decode_cf=False,
        decode_times=False,
        concat_dim="Time",
        mask_and_scale=False,
        chunks=chunks,
        lock=False,     
    )

But this led to total destruction, as soon as “to_netcdf” was called:

HDF5-DIAG: Error detected in HDF5 (1.14.3) thread 1:
  #000: H5D.c line 611 in H5Dget_space(): unable to synchronously get dataspace
    major: Dataset
    minor: Can't get value
  #001: H5D.c line 571 in H5D__get_space_api_common(): invalid dataset identifier
    major: Invalid arguments to routine
    minor: Inappropriate type
  File "/home/bartoletti1/mambaforge/envs/dsm_prod_e2c_1.11.2rc2/lib/python3.12/site-packages/e3sm_to_cmip/__main__.py", line 829, in _run_serial
    name = handler_method(
           ^^^^^^^^^^^^^^^
  File "/home/bartoletti1/mambaforge/envs/dsm_prod_e2c_1.11.2rc2/lib/python3.12/site-packages/e3sm_to_cmip/cmor_handlers/mpas_vars/siv.py", line 71, in handle
    ds = mpas.remap(ds, 'mpasseaice', mappingFileName)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bartoletti1/mambaforge/envs/dsm_prod_e2c_1.11.2rc2/lib/python3.12/site-packages/e3sm_to_cmip/mpas.py", line 94, in remap
    write_netcdf(ds, inFileName, unlimited="time")
  File "/home/bartoletti1/mambaforge/envs/dsm_prod_e2c_1.11.2rc2/lib/python3.12/site-packages/e3sm_to_cmip/mpas.py", line 380, in write_netcdf
    ds.to_netcdf(fileName, encoding=encodingDict, unlimited_dims=unlimited)
  File "/home/bartoletti1/mambaforge/envs/dsm_prod_e2c_1.11.2rc2/lib/python3.12/site-packages/xarray/core/dataset.py", line 2298, in to_netcdf
    return to_netcdf(  # type: ignore  # mypy cannot resolve the overloads:(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bartoletti1/mambaforge/envs/dsm_prod_e2c_1.11.2rc2/lib/python3.12/site-packages/xarray/backends/api.py", line 1348, in to_netcdf
    writes = writer.sync(compute=compute)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bartoletti1/mambaforge/envs/dsm_prod_e2c_1.11.2rc2/lib/python3.12/site-packages/xarray/backends/common.py", line 297, in sync
    delayed_store = chunkmanager.store(
                    ^^^^^^^^^^^^^^^^^^^
  File "/home/bartoletti1/mambaforge/envs/dsm_prod_e2c_1.11.2rc2/lib/python3.12/site-packages/xarray/namedarray/daskmanager.py", line 249, in store
    return store(
           ^^^^^^
  File "/home/bartoletti1/mambaforge/envs/dsm_prod_e2c_1.11.2rc2/lib/python3.12/site-packages/dask/array/core.py", line 1236, in store
    compute_as_if_collection(Array, store_dsk, map_keys, **kwargs)
  File "/home/bartoletti1/mambaforge/envs/dsm_prod_e2c_1.11.2rc2/lib/python3.12/site-packages/dask/base.py", line 402, in compute_as_if_collection
    return schedule(dsk2, keys, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bartoletti1/mambaforge/envs/dsm_prod_e2c_1.11.2rc2/lib/python3.12/site-packages/dask/threaded.py", line 90, in get
    results = get_async(
              ^^^^^^^^^^
  File "/home/bartoletti1/mambaforge/envs/dsm_prod_e2c_1.11.2rc2/lib/python3.12/site-packages/dask/local.py", line 512, in get_async
    raise_exception(exc, tb)
  File "/home/bartoletti1/mambaforge/envs/dsm_prod_e2c_1.11.2rc2/lib/python3.12/site-packages/dask/local.py", line 320, in reraise
    raise exc
  File "/home/bartoletti1/mambaforge/envs/dsm_prod_e2c_1.11.2rc2/lib/python3.12/site-packages/dask/local.py", line 225, in execute_task
    result = _execute_task(task, data)
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bartoletti1/mambaforge/envs/dsm_prod_e2c_1.11.2rc2/lib/python3.12/site-packages/dask/core.py", line 127, in _execute_task
    return func(*(_execute_task(a, cache) for a in args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bartoletti1/mambaforge/envs/dsm_prod_e2c_1.11.2rc2/lib/python3.12/site-packages/dask/array/core.py", line 125, in getter
    c = np.asarray(c)
        ^^^^^^^^^^^^^
  File "/home/bartoletti1/mambaforge/envs/dsm_prod_e2c_1.11.2rc2/lib/python3.12/site-packages/xarray/core/indexing.py", line 484, in __array__
    return np.asarray(self.get_duck_array(), dtype=dtype)
                      ^^^^^^^^^^^^^^^^^^^^^
  File "/home/bartoletti1/mambaforge/envs/dsm_prod_e2c_1.11.2rc2/lib/python3.12/site-packages/xarray/core/indexing.py", line 487, in get_duck_array
    return self.array.get_duck_array()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bartoletti1/mambaforge/envs/dsm_prod_e2c_1.11.2rc2/lib/python3.12/site-packages/xarray/core/indexing.py", line 664, in get_duck_array
    return self.array.get_duck_array()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bartoletti1/mambaforge/envs/dsm_prod_e2c_1.11.2rc2/lib/python3.12/site-packages/xarray/core/indexing.py", line 551, in get_duck_array
    array = self.array[self.key]
            ~~~~~~~~~~^^^^^^^^^^
  File "/home/bartoletti1/mambaforge/envs/dsm_prod_e2c_1.11.2rc2/lib/python3.12/site-packages/xarray/backends/netCDF4_.py", line 101, in __getitem__
    return indexing.explicit_indexing_adapter(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bartoletti1/mambaforge/envs/dsm_prod_e2c_1.11.2rc2/lib/python3.12/site-packages/xarray/core/indexing.py", line 858, in explicit_indexing_adapter
    result = raw_indexing_method(raw_key.tuple)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/home/bartoletti1/mambaforge/envs/dsm_prod_e2c_1.11.2rc2/lib/python3.12/site-packages/xarray/backends/netCDF4_.py", line 114, in _getitem
    array = getitem(original_array, key)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "src/netCDF4/_netCDF4.pyx", line 4972, in netCDF4._netCDF4.Variable.__getitem__
  File "src/netCDF4/_netCDF4.pyx", line 5930, in netCDF4._netCDF4.Variable._get
  File "src/netCDF4/_netCDF4.pyx", line 2034, in netCDF4._netCDF4._ensure_nc_success
Resource temporarily unavailable
^[[91m[-] ^[[0m^[[1mError running handlers: siv^[[0m
Error 11: Resource temporarily unavailable

Note, however, that in “remap_seaice_sgs()”, we are calling xarray.open_mfdataset directly:

    ds_out_all = xarray.open_mfdataset(
        f"{outFilePath}/temp_out*nc", data_vars="minimal", decode_times=False
    )

Should this be “mpas.open_mfdataset”, or should this also get “lock=False” separately?

Also Note: in “cmor_handlers/handler.py”, we call

        ds = xc.open_mfdataset(
            all_filepaths,
            add_bounds=["X", "Y"],
            decode_times=False,
            combine="nested",
            data_vars="minimal",
            coords="minimal",
        )    

            where “xc” comes from “import xcdat as xc”

So, we have 3 competing instances of “open_mfdataset”:

•   One raw call to xarray.open_mfdataset (in remap_seaice_sgs)
•   One inherited from xcdat (in handler.py)
•   All others inherit “lock=FALSE” from mpas.py: open_mfdataset()

That last one includes the individual variable handlers, each of which which call “mpas.open_mfdataset”

Any ideas about this?  I can always experiment … 
From: Jill Chengzhu Zhang <[zhang40@llnl.gov](mailto:zhang40@llnl.gov)>
Date: Thursday, March 14, 2024 at 9:22 AM
To: Tony Bartoletti <[bartoletti1@llnl.gov](mailto:bartoletti1@llnl.gov)>, Tom Vo <[vo13@llnl.gov](mailto:vo13@llnl.gov)>
Subject: Re: Open_mfdataset not good.

Oh, I didn’t catch this. It seems like we should first try adding “lock=False” when use xarray. open_mfdataset() in remap_seaice_sgs(). 
And I don’t understand the error with “lock=False” on in the mpas handlers…
from Tom, Mar 14 at 9:43 AM:

Can you open a GitHub issue related to this issue and move all the info from this email and the other email there? That way there is a single, traceable thread to make it easier to debug using GitHub. I’ll respond here for now.

This is the most relevant section of the stack trace related to code directly in e3sm_to_cmip:

HDF5-DIAG: Error detected in HDF5 (1.14.3) thread 1:
  #000: H5D.c line 611 in H5Dget_space(): unable to synchronously get dataspace
    major: Dataset
    minor: Can't get value
  #001: H5D.c line 571 in H5D__get_space_api_common(): invalid dataset identifier
    major: Invalid arguments to routine
    minor: Inappropriate type
  File "/home/bartoletti1/mambaforge/envs/dsm_prod_e2c_1.11.2rc2/lib/python3.12/site-packages/e3sm_to_cmip/__main__.py", line 829, in _run_serial
    name = handler_method(
           ^^^^^^^^^^^^^^^
  File "/home/bartoletti1/mambaforge/envs/dsm_prod_e2c_1.11.2rc2/lib/python3.12/site-packages/e3sm_to_cmip/cmor_handlers/mpas_vars/[siv.py](https://github.com/E3SM-Project/e3sm_to_cmip/blob/0f91fb4abc839e2ec04fe8e770ad7308a41c6620/e3sm_to_cmip/cmor_handlers/mpas_vars/siv.py#L60-L71)", line 71, in handle
    ds = mpas.remap(ds, 'mpasseaice', mappingFileName)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bartoletti1/mambaforge/envs/dsm_prod_e2c_1.11.2rc2/lib/python3.12/site-packages/e3sm_to_cmip/[mpas.py](https://github.com/E3SM-Project/e3sm_to_cmip/blob/0f91fb4abc839e2ec04fe8e770ad7308a41c6620/e3sm_to_cmip/mpas.py#L77-L87)", line 94, in remap
    write_netcdf(ds, inFileName, unlimited="time")
  File "/home/bartoletti1/mambaforge/envs/dsm_prod_e2c_1.11.2rc2/lib/python3.12/site-packages/e3sm_to_cmip/mpas.py", line 380, in write_netcdf
    ds.to_netcdf(fileName, encoding=encodingDict, unlimited_dims=unlimited)
  File "/home/bartoletti1/mambaforge/envs/dsm_prod_e2c_1.11.2rc2/lib/python3.12/site-packages/xarray/core/dataset.py", line 2298, in to_netcdf
    return to_netcdf(  # type: ignore  # mypy cannot resolve the overloads:(

Notice the call is to mpas.remap() + write_netcdf(), not mpas.remap_seaice_sgs(). Here are the related lines of code in order of the stack trace:
1.  siv.py 
2.  mpas.py
3.  Then it breaks

Related GitHub issue (notice HDF-DIAG: Error). The `lock` argument was added through a PR from this GitHub issue.
•   [File written by to_netcdf() not closed when Dataset is generated from dask delayed object using a dask Client() #2376](https://github.com/pydata/xarray/issues/2376) 

`lock` arg definition in xarray [docs](https://docs.xarray.dev/en/stable/generated/xarray.open_dataset.html#xarray.open_dataset):
•   ‘lock’: resource lock to use when reading data from disk. Only relevant when using dask or another form of parallelism. By default, appropriate locks are chosen to safely read and write files with the currently active dask scheduler. Supported by “netcdf4”, “h5netcdf”, “scipy”, “pynio”.

Thanks,
Tom
Tony adds: Mar 14, 10:17 AM:

That last stack trace does not include "remap_seaice_sgs", true.  But (until the hang occurs) it is regularly called when processing sea-ice (SImon) variables:

2024-03-13 17:10:29,306 [INFO]: mpas.py(remap_seaice_sgs:45) >> DBG: mpas.py: entered remap_seaice_sgs
2024-03-13 17:10:29,306 [INFO]: mpas.py(remap_seaice_sgs:45) >> DBG: mpas.py: entered remap_seaice_sgs
2024-03-13 17:10:29,306_306:INFO:remap_seaice_sgs:DBG: mpas.py: entered remap_seaice_sgs
2024-03-13 17:10:29,306 [INFO]: mpas.py(remap_seaice_sgs:47) >> DBG: mpas.py:   got os.environ.copy
2024-03-13 17:10:29,306 [INFO]: mpas.py(remap_seaice_sgs:47) >> DBG: mpas.py:   got os.environ.copy
2024-03-13 17:10:29,306_306:INFO:remap_seaice_sgs:DBG: mpas.py:   got os.environ.copy
2024-03-13 17:10:29,368 [INFO]: mpas.py(remap_seaice_sgs:51) >> DBG: mpas.py:   made dir: /p/user_pub/e3sm/bartoletti1/tmp/s3l8ndri/tmpxb7cz3uusub
2024-03-13 17:10:29,368 [INFO]: mpas.py(remap_seaice_sgs:51) >> DBG: mpas.py:   made dir: /p/user_pub/e3sm/bartoletti1/tmp/s3l8ndri/tmpxb7cz3uusub
2024-03-13 17:10:29,368_368:INFO:remap_seaice_sgs:DBG: mpas.py:   made dir: /p/user_pub/e3sm/bartoletti1/tmp/s3l8ndri/tmpxb7cz3uusub
2024-03-13 17:16:26,420 [INFO]: mpas.py(remap_seaice_sgs:45) >> DBG: mpas.py: entered remap_seaice_sgs
2024-03-13 17:16:26,420 [INFO]: mpas.py(remap_seaice_sgs:45) >> DBG: mpas.py: entered remap_seaice_sgs
2024-03-13 17:16:26,420_420:INFO:remap_seaice_sgs:DBG: mpas.py: entered remap_seaice_sgs
2024-03-13 17:16:26,420 [INFO]: mpas.py(remap_seaice_sgs:47) >> DBG: mpas.py:   got os.environ.copy
2024-03-13 17:16:26,420 [INFO]: mpas.py(remap_seaice_sgs:47) >> DBG: mpas.py:   got os.environ.copy
2024-03-13 17:16:26,420_420:INFO:remap_seaice_sgs:DBG: mpas.py:   got os.environ.copy
2024-03-13 17:16:26,470 [INFO]: mpas.py(remap_seaice_sgs:51) >> DBG: mpas.py:   made dir: /p/user_pub/e3sm/bartoletti1/tmp/f_kl_ie3/tmpq0x2znvcsub
2024-03-13 17:16:26,470 [INFO]: mpas.py(remap_seaice_sgs:51) >> DBG: mpas.py:   made dir: /p/user_pub/e3sm/bartoletti1/tmp/f_kl_ie3/tmpq0x2znvcsub
2024-03-13 17:16:26,470_470:INFO:remap_seaice_sgs:DBG: mpas.py:   made dir: /p/user_pub/e3sm/bartoletti1/tmp/f_kl_ie3/tmpq0x2znvcsub

etc.  So I was concerned about the variability in "open_mfdataset" calls in general.  To Reiterate:

we have 3 competing instances of “open_mfdataset”:

•   One raw call to xarray.open_mfdataset (in remap_seaice_sgs)
•   One inherited from xcdat (in handler.py)
•   All others inherit “lock=FALSE” from mpas.py: open_mfdataset()

That last one includes the individual variable handlers, each of which which call “mpas.open_mfdataset”

Granted, the precise calling sequence among these is still a mystery to me . . .

Environment

      mamba version : 1.5.7
 active environment : dsm_prod_e2c_1.11.2rc2
active env location : /home/bartoletti1/mambaforge/envs/dsm_prod_e2c_1.11.2rc2
        shell level : 2
   user config file : /home/bartoletti1/.condarc

populated config files : /home/bartoletti1/mambaforge/.condarc conda version : 24.1.2 conda-build version : not installed python version : 3.10.6.final.0 solver : libmamba (default) virtual packages : archspec=1=broadwell conda=24.1.2=0 glibc=2.17=0 linux=3.10.0=0 __unix=0=0 base environment : /home/bartoletti1/mambaforge (writable) conda av data dir : /home/bartoletti1/mambaforge/etc/conda conda av metadata url : None channel URLs : https://conda.anaconda.org/conda-forge/linux-64 https://conda.anaconda.org/conda-forge/noarch package cache : /home/bartoletti1/mambaforge/pkgs /home/bartoletti1/.conda/pkgs envs directories : /home/bartoletti1/mambaforge/envs /home/bartoletti1/.conda/envs platform : linux-64 user-agent : conda/24.1.2 requests/2.31.0 CPython/3.10.6 Linux/3.10.0-1160.108.1.el7.x86_64 rhel/7.9 glibc/2.17 solver/libmamba conda-libmamba-solver/24.1.0 libmambapy/1.5.7 UID:GID : 61843:4061 netrc file : None offline mode : False

tomvothecoder commented 8 months ago

Things to try next:

TonyB9000 commented 7 months ago

@tomvothecoder

Given a file containing one or more CMIP6 dataset_ids, the script:

/home/bartoletti1/gitrepo/datasm/datasm/scripts/dsm_generate_CMIP6_dsid_list_2.sh

will take the positional parameters "WORK ", and for each given dataset_id, will produce a local subordinate script

tmp/<case_id>/scripts/<dataset_id>-Generate_CMIP6.sh

that is the "minimum script" to conduct end-to-end ([NCO] + e3sm_to_cmip) processing.

NOTE: The parent script (dsm_generate_CMIP6_dsid_list_2.sh) requires that you have

export DSM_GETPATH=/p/user_pub/e3sm/staging/Relocation/.dsm_get_root_path.sh

in your .bashrc so that various datasm tools that garner the dataset_id_specific parameters (data location, proper mapfile, etc etc) can be obtained. But the resulting subordinate script contains the fully-expressed command lines (with ALL paths fully-qualified) and can be run from anywhere. The results will be placed into various subdirectories under tmp/<case_id>.

ADDENDUM: If you substitute "TEST" for "WORK" as the first parameter, only 1 year of data will be processed, and the resulting output files will not be moved into the warehouse (they will remain in the tmp//rgr and product directories.

Near the top of the script, you can set "dryrun=1", in which case the generated subordinate scripts will not be run, only produced and ready to run.