E3SM-Project / e3sm_to_cmip

Tools to CMORize E3SM output
https://e3sm-to-cmip.readthedocs.io/en/latest/
MIT License
7 stars 7 forks source link

[Bug]: e2c command for SImon proess leads to xarray fault. #255

Open TonyB9000 opened 7 months ago

TonyB9000 commented 7 months ago

What happened?

e3sm_to_cmip command leads to a cascade of xarray errors, beginning with "cmor_handlers/mpas_vars/siv.py (ds.compute())" call.

What did you expect to happen? Are there are possible answers you came across?

n/a

Minimal Complete Verifiable Example (MVCE)

CLI command

e3sm_to_cmip -v siv -u /p/user_pub/e3sm/bartoletti1/Operations/5_DatasetGeneration/AltProcess/tmp/v2_1.LR.historical_0251/metadata/historical_r4i1p1f1.json -t /p/user_pub/e3sm/staging/resource/cmor/cmip6-cmor-tables/Tables -o /p/user_pub/e3sm/bartoletti1/Operations/5_DatasetGeneration/AltProcess/tmp/v2_1.LR.historical_0251/product -i /p/user_pub/e3sm/bartoletti1/Operations/5_DatasetGeneration/AltProcess/tmp/v2_1.LR.historical_0251/native_data -s --realm SImon --map /p/user_pub/e3sm/staging/resource/maps/map_EC30to60E2r2_to_cmip6_180x360_aave.20220301.nc >> Local_log 2>&1 &

Python script

from e3sm_to_cmip.__main__ import E3SMtoCMIP

args = [
    "-v",
    "siv",
    "-u",
    "/p/user_pub/e3sm/bartoletti1/Operations/5_DatasetGeneration/AltProcess/tmp/v2_1.LR.historical_0251/metadata/historical_r4i1p1f1.json",
    "-t",
    "/p/user_pub/e3sm/staging/resource/cmor/cmip6-cmor-tables/Tables",
    "-o",
    "../qa/tmp",
    "-i",
    "/p/user_pub/e3sm/bartoletti1/Operations/5_DatasetGeneration/AltProcess/tmp/v2_1.LR.historical_0251/native_data",
    "-s",
    "--realm",
    "SImon",
    "--map",
    "/p/user_pub/e3sm/staging/resource/maps/map_EC30to60E2r2_to_cmip6_180x360_aave.20220301.nc",
    "--serial"
]

run = E3SMtoCMIP(args)
run.run()

Relevant log output

2024-04-08 21:28:24,229 [INFO]: siv.py(handle:48) >> Starting siv
2024-04-08 21:28:24,229_229:INFO:handle:Starting siv
  File "/home/bartoletti1/mambaforge/envs/dsm_prod_inst_e2c_local/lib/python3.12/site-packages/e3sm_to_cmip/__main__.py", line 834, in _run_serial
    name = handler_method(
           ^^^^^^^^^^^^^^^
  File "/home/bartoletti1/mambaforge/envs/dsm_prod_inst_e2c_local/lib/python3.12/site-packages/e3sm_to_cmip/cmor_handlers/mpas_vars/siv.py", line 67, in handle
    ds.compute()
  File "/home/bartoletti1/mambaforge/envs/dsm_prod_inst_e2c_local/lib/python3.12/site-packages/xarray/core/dataset.py", line 1013, in compute
    return new.load(**kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/bartoletti1/mambaforge/envs/dsm_prod_inst_e2c_local/lib/python3.12/site-packages/xarray/core/dataset.py", line 845, in load
    evaluated_data: tuple[np.ndarray[Any, Any], ...] = chunkmanager.compute(
                                                       ^^^^^^^^^^^^^^^^^^^^^
  File "/home/bartoletti1/mambaforge/envs/dsm_prod_inst_e2c_local/lib/python3.12/site-packages/xarray/namedarray/daskmanager.py", line 86, in compute
    return compute(*data, **kwargs)  # type: ignore[no-untyped-call, no-any-return]
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bartoletti1/mambaforge/envs/dsm_prod_inst_e2c_local/lib/python3.12/site-packages/dask/base.py", line 661, in compute
    results = schedule(dsk, keys, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bartoletti1/mambaforge/envs/dsm_prod_inst_e2c_local/lib/python3.12/site-packages/xarray/core/indexing.py", line 484, in __array__
    return np.asarray(self.get_duck_array(), dtype=dtype)
                      ^^^^^^^^^^^^^^^^^^^^^
  File "/home/bartoletti1/mambaforge/envs/dsm_prod_inst_e2c_local/lib/python3.12/site-packages/xarray/core/indexing.py", line 487, in get_duck_array
    return self.array.get_duck_array()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bartoletti1/mambaforge/envs/dsm_prod_inst_e2c_local/lib/python3.12/site-packages/xarray/core/indexing.py", line 664, in get_duck_array
    return self.array.get_duck_array()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bartoletti1/mambaforge/envs/dsm_prod_inst_e2c_local/lib/python3.12/site-packages/xarray/core/indexing.py", line 551, in get_duck_array
    array = self.array[self.key]
            ~~~~~~~~~~^^^^^^^^^^
  File "/home/bartoletti1/mambaforge/envs/dsm_prod_inst_e2c_local/lib/python3.12/site-packages/xarray/backends/netCDF4_.py", line 101, in __getitem__
    return indexing.explicit_indexing_adapter(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bartoletti1/mambaforge/envs/dsm_prod_inst_e2c_local/lib/python3.12/site-packages/xarray/core/indexing.py", line 858, in explicit_indexing_adapter
    result = raw_indexing_method(raw_key.tuple)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bartoletti1/mambaforge/envs/dsm_prod_inst_e2c_local/lib/python3.12/site-packages/xarray/backends/netCDF4_.py", line 114, in _getitem
    array = getitem(original_array, key)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "src/netCDF4/_netCDF4.pyx", line 4972, in netCDF4._netCDF4.Variable.__getitem__
  File "src/netCDF4/_netCDF4.pyx", line 5930, in netCDF4._netCDF4.Variable._get
  File "src/netCDF4/_netCDF4.pyx", line 2034, in netCDF4._netCDF4._ensure_nc_success
Resource temporarily unavailable
[-] Error running handlers: siv
Error 11: Resource temporarily unavailable
  0%|          | 0/1 [16:16<?, ?it/s]

Anything else we need to know?

The input directory (/p/user_pub/e3sm/bartoletti1/Operations/5_DatasetGeneration/AltProcess/tmp/v2_1.LR.historical_0251/native_data) contains symlinks to the actual native data, as well as to the region file:

EC30to60E2r2_mocBasinsAndTransects20210623.nc (/p/user_pub/e3sm/staging/resource/maps/EC30to60E2r2_mocBasinsAndTransects20210623.nc)

and to the namefile:

mpassi_in (/p/user_pub/e3sm/warehouse/E3SM/2_1/historical/LR/sea-ice/native/namefile/fixed/ens4/v0/mpassi_in)

as is required for MPAS cmorizing.

Environment

 active environment : dsm_prod_inst_e2c_local
active env location : /home/bartoletti1/mambaforge/envs/dsm_prod_inst_e2c_local
        shell level : 4
   user config file : /home/bartoletti1/.condarc

populated config files : /home/bartoletti1/mambaforge/.condarc conda version : 24.1.2 conda-build version : not installed python version : 3.10.6.final.0 solver : libmamba (default) virtual packages : archspec=1=broadwell conda=24.1.2=0 glibc=2.17=0 linux=3.10.0=0 __unix=0=0 base environment : /home/bartoletti1/mambaforge (writable) conda av data dir : /home/bartoletti1/mambaforge/etc/conda conda av metadata url : None channel URLs : https://conda.anaconda.org/conda-forge/linux-64 https://conda.anaconda.org/conda-forge/noarch package cache : /home/bartoletti1/mambaforge/pkgs /home/bartoletti1/.conda/pkgs envs directories : /home/bartoletti1/mambaforge/envs /home/bartoletti1/.conda/envs platform : linux-64 user-agent : conda/24.1.2 requests/2.31.0 CPython/3.10.6 Linux/3.10.0-1160.108.1.el7.x86_64 rhel/7.9 glibc/2.17 solver/libmamba conda-libmamba-solver/24.1.0 libmambapy/1.5.7 UID:GID : 61843:4061 netrc file : None offline mode : False

TonyB9000 commented 7 months ago

All MPAS processing is generating these errors (Omon and SImon). A summary:

Compressed stack traces:

    Code Path:  "home/bartoletti1/mambaforge/envs/dsm_prod_inst_e2c_local/lib/python3.12/site-packages/

TEST: CMIP6.CMIP.E3SM-Project.E3SM-2-1.piControl.r1i1p1f1.SImon.siv.gr

    2024-04-16 20:26:22,381_381:INFO:handle:Starting siv
e3sm_to_cmip/__main__.py", line 834, in _run_serial      [ name = handler_method( ]
e3sm_to_cmip/cmor_handlers/mpas_vars/siv.py", line 65, in handle [ ds = mpas.add_time(ds, dsIn) ]
e3sm_to_cmip/mpas.py", line 161, in add_time             [ xtimeEnd = ["".join(x.astype("U")).strip() for x in xtimeEnd.values] ]
xarray/core/dataarray.py", line 770, in values           [ return self.variable.values ]
xarray/core/variable.py", line 508, in values            [ return _as_array_or_item(self._data) ]
xarray/core/variable.py", line 306, in _as_array_or_item [ data = np.asarray(data) ]
dask/array/core.py", line 1700, in __array__             [ x = self.compute() ]
dask/base.py", line 375, in compute                      [ (result,) = compute(self, traverse=False, **kwargs) ]
dask/base.py", line 661, in compute                      [ results = schedule(dsk, keys, **kwargs) ]
xarray/core/indexing.py", line 484, in __array__         [ return np.asarray(self.get_duck_array(), dtype=dtype) ]
xarray/core/indexing.py", line 487, in get_duck_array    [ return self.array.get_duck_array() ]
xarray/core/indexing.py", line 664, in get_duck_array    [ return self.array.get_duck_array() ]
xarray/core/indexing.py", line 551, in get_duck_array    [ array = self.array[self.key] ]
xarray/backends/netCDF4_.py", line 101, in __getitem__   [ return indexing.explicit_indexing_adapter( ]
xarray/core/indexing.py", line 858, in explicit_indexing_adapter [ result = raw_indexing_method(raw_key.tuple) ]
xarray/backends/netCDF4_.py", line 114, in _getitem      [ array = getitem(original_array, key) ]

    (FOLLOWED BY)

  File "src/netCDF4/_netCDF4.pyx", line 4948, in netCDF4._netCDF4.Variable.__getitem__
  File "src/netCDF4/_netCDF4.pyx", line 4551, in netCDF4._netCDF4.Variable.shape.__get__
  File "src/netCDF4/_netCDF4.pyx", line 4499, in netCDF4._netCDF4.Variable._getdims
  File "src/netCDF4/_netCDF4.pyx", line 2034, in netCDF4._netCDF4._ensure_nc_success
NetCDF: Not a valid ID
^[[91m[-] ^[[0m^[[1mError running handlers: siv^[[0m
^M  0%|          | 0/1 [38:48<?, ?it/s]

TEST: CMIP6.CMIP.E3SM-Project.E3SM-2-1.piControl.r1i1p1f1.Omon.thetao.gr.log

    (SAME stack trace as above, then …)
    (FOLLOWED BY)

  File "src/netCDF4/_netCDF4.pyx", line 4972, in netCDF4._netCDF4.Variable.__getitem__
  File "src/netCDF4/_netCDF4.pyx", line 5930, in netCDF4._netCDF4.Variable._get
  File "src/netCDF4/_netCDF4.pyx", line 2034, in netCDF4._netCDF4._ensure_nc_success
Resource temporarily unavailable
[-] Error running handlers: thetao
Error 11: Resource temporarily unavailable
Error 11: Resource temporarily unavailable
Error 11: Resource temporarily unavailable
Error 11: Resource temporarily unavailable
Error 11: Resource temporarily unavailable
  0%|          | 0/1 [00:05<?, ?it/s]
chengzhuzhang commented 7 months ago

@TonyB9000 the Minimal Complete Verifiable Example (MVCE) no longer works: when I check I only find eam data in /p/user_pub/e3sm/bartoletti1/Operations/5_DatasetGeneration/AltProcess/tmp/v2_1.LR.historical_0251/native_data. When create this example, you may want to isolate the test data so others can test. But for this issue, I can test with some v2 data available on acme1.

chengzhuzhang commented 7 months ago

Alright, I can reproduce this with 12 months data with v1.11.2:

 e3sm_to_cmip -s --realm SImon --var-list siv --map /p/user_pub/e3sm/staging/resource/maps/map_EC30to60E2r2_to_cmip6_180x360_aave.20220301.nc --input-path /p/user_pub/e3sm/zhang40/e2c_tony/e2c_test_data/v2.mpassi_input/ --output-path /p/user_pub/e3sm/zhang40/e2c_tony/tests --user-metadata /p/user_pub/e3sm/zhang40/e2c_tony/e2c_test_data/holodeck/input/historical_r1i1p1f1.json --tables-path /p/user_pub/e3sm/staging/resource/cmor/cmip6-cmor-tables/Tables
chengzhuzhang commented 7 months ago

I suspect that the "Resource temporarily unavailable " error is from the use of "lock=False" https://github.com/E3SM-Project/e3sm_to_cmip/commit/0f91fb4abc839e2ec04fe8e770ad7308a41c6620. And when testing v1.11.1, things work okay.

With lock=True, large operations hang sometime, with "lock=False" cause "Resource temporarily unavailable " for some variables.

chengzhuzhang commented 7 months ago

@tomvothecoder I don't know if this is specific to acme1's file system, but should we consider revert to use 1.11.1 for upcoming e3sm_unified deployment? I think let's decide after a more thorough investigation.

tomvothecoder commented 7 months ago

I suspect that the "Resource temporarily unavailable " error is from the use of "lock=False" 0f91fb4. And when testing v1.11.1, things work okay.

With lock=True, large operations hang sometime, with "lock=False" cause "Resource temporarily unavailable " for some variables.

I think you might be right about lock=False causing "Resource temporarily unavailable", possibly due to resource contention with multiprocessing.

Related issues

@tomvothecoder I don't know if this is specific to acme1's file system, but should we consider revert to use 1.11.1 for upcoming e3sm_unified deployment? I think let's decide after a more thorough investigation.

I agree that more investigation is needed before deciding what to do next. v1.11.2 includes another fix for parallel=False to avoid NetCDF: Not a Valid ID with open_mfdataset(), which we need to keep too.

My understanding is that Tony is still using acme1 and datasm with e3sm_to_cmip. Is e3sm_to_cmip used by anybody else in E3SM Unified, besides zppy? If zppy tests pass, then we have the option of keeping v1.11.2 for now with the risk of this bug coming up. It sounds like Xylar pressing to move forward with E3SM Unified v1.10.0, so I don't know how much time we have.

tomvothecoder commented 7 months ago

Minimal Complete Verifiable Example (MVCE)

Command line issued (directly at console):

e3sm_to_cmip -v siv -u /p/user_pub/e3sm/bartoletti1/Operations/5_DatasetGeneration/AltProcess/tmp/v2_1.LR.historical_0251/metadata/historical_r4i1p1f1.json -t /p/user_pub/e3sm/staging/resource/cmor/cmip6-cmor-tables/Tables -o /p/user_pub/e3sm/bartoletti1/Operations/5_DatasetGeneration/AltProcess/tmp/v2_1.LR.historical_0251/product -i /p/user_pub/e3sm/bartoletti1/Operations/5_DatasetGeneration/AltProcess/tmp/v2_1.LR.historical_0251/native_data -s --realm SImon --map /p/user_pub/e3sm/staging/resource/maps/map_EC30to60E2r2_to_cmip6_180x360_aave.20220301.nc >> Local_log 2>&1 &

For future reference, you can use a Python script to invoke e3sm_to_cmip. This makes debugging much easier by enabling the use of the Python debugger. I included examples in the documentation here: https://e3sm-to-cmip.readthedocs.io/en/latest/development-guide.html

I would assume this issue occurs with multiprocessing (serial=False), but I'm not 100% sure. I just noticed the log output says that it is serial.

chengzhuzhang commented 7 months ago

I suspect that the "Resource temporarily unavailable " error is from the use of "lock=False" 0f91fb4. And when testing v1.11.1, things work okay. With lock=True, large operations hang sometime, with "lock=False" cause "Resource temporarily unavailable " for some variables.

I think you might be right about lock=False causing "Resource temporarily unavailable", possibly due to resource contention with multiprocessing.

Related issues

* [NETCDF - Resource temporarily unavailable  dask/dask#2657](https://github.com/dask/dask/issues/2657)

* [netCDF4 Resource unavailable error pydata/xarray#4304](https://github.com/pydata/xarray/issues/4304)

@tomvothecoder I don't know if this is specific to acme1's file system, but should we consider revert to use 1.11.1 for upcoming e3sm_unified deployment? I think let's decide after a more thorough investigation.

I agree that more investigation is needed before deciding what to do next. v1.11.2 includes another fix for parallel=False to avoid NetCDF: Not a Valid ID with open_mfdataset(), which we need to keep too.

My understanding is that Tony is still using acme1 and datasm with e3sm_to_cmip. Is e3sm_to_cmip used by anybody else in E3SM Unified, besides zppy? If zppy tests pass, then have the option of keeping v1.11.2 for now with the risk of this bug coming up. It sounds like Xylar pressing to move forward with E3SM Unified v1.10.0, so I don't know how much time we have.

Thanks for your insight. Based on Tony's recent operation, I think the zppy application, only including processing atmopshere and land variables are not impacted. So perhaps, we can spend time to resovle this more ocean, sea-ice related error on the side.

tomvothecoder commented 7 months ago

Thanks for your insight. Based on Tony's recent operation, I think the zppy application, only including processing atmopshere and land variables are not impacted. So perhaps, we can spend time to resovle this more ocean, sea-ice related error on the side.

Sounds good. I will convert Tony's command into a Python script and try to do some debugging today.

TonyB9000 commented 7 months ago

@tomvothecoder @chengzhuzhang Just a side note - more than half of the v2_1 MPAS jobs (run with datasm back in February) completed successfully. So, I was going to examine the generation dates to see if the first appearance of failures corresponds to any changes I (or others) make have made to e2c.

I am interested in the python-driver you (Tom) mentioned, and will investigate.

TonyB9000 commented 7 months ago

@chengzhuzhang Jill wrote "When create this example, you may want to isolate the test data so others can test."

Sorry about that. Yes, I neglected to isolate that data for retesting.

TonyB9000 commented 7 months ago

@tomvothecoder @chengzhuzhang Just for the record, we are running into the same errors on Chrysalis - so this is not limited to acme1.

KEY FACT: Circa February 10, both Omon and SImon jobs (v2_1 data) were completing successfully (via datasm postprocess). Soon thereafter, sporadic failures became the norm. I have maintained all of the runlogs for the 296 (v2_1) MPAS jobs, of which 26 are consistently failing. I will seek to characterize the failure dates and types, to see if I can isolate a date upon which to focus.

chengzhuzhang commented 7 months ago

I would suggest to revert the change that sets lock=False. If i understand correctly, this change warrants all mpas variables to fail, no matter on acme1 or Chrysalis.

TonyB9000 commented 7 months ago

Turns out, lock=False was set on both acme1 and chrysalis. Setting it to "True" on chrysalis lead to a small change at the end of the stack-trace:


  File "/home/ac.bartoletti1/anaconda3/envs/dsm_prod_e2c_local/lib/python3.12/site-packages/xarray/backends/netCDF4_.py", line 112, in _getitem
    with self.datastore.lock:
^M  0%|          | 0/1 [00:00<?, ?it/s]
'bool' object does not support the context manager protocol
TonyB9000 commented 6 months ago

With "lock" parameter elided, the v2_1 1pctCO2 Omon.sosga completes successfully, as it had in February. But Omon.so fails. The stack-trace:

2024-04-29 15:13:07,374 [INFO]: so.py(handle:47) >> Starting so
  File "/home/bartoletti1/mambaforge/envs/dsm_prod_inst_e2c_local2/lib/python3.12/site-packages/e3sm_to_cmip/__main__.py", line 831, in _run_serial
    name = handler_method(
           ^^^^^^^^^^^^^^^
  File "/home/bartoletti1/mambaforge/envs/dsm_prod_inst_e2c_local2/lib/python3.12/site-packages/e3sm_to_cmip/cmor_handlers/mpas_vars/so.py", line 68, in handle
    ds = mpas.remap(ds, 'mpasocean', mappingFileName)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bartoletti1/mambaforge/envs/dsm_prod_inst_e2c_local2/lib/python3.12/site-packages/e3sm_to_cmip/mpas.py", line 87, in remap
    write_netcdf(ds, inFileName, unlimited="time")
  File "/home/bartoletti1/mambaforge/envs/dsm_prod_inst_e2c_local2/lib/python3.12/site-packages/e3sm_to_cmip/mpas.py", line 366, in write_netcdf
    ds.to_netcdf(fileName, encoding=encodingDict, unlimited_dims=unlimited)
  File "/home/bartoletti1/mambaforge/envs/dsm_prod_inst_e2c_local2/lib/python3.12/site-packages/xarray/core/dataset.py", line 2298, in to_netcdf
    return to_netcdf(  # type: ignore  # mypy cannot resolve the overloads:(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bartoletti1/mambaforge/envs/dsm_prod_inst_e2c_local2/lib/python3.12/site-packages/xarray/backends/api.py", line 1339, in to_netcdf
    dump_to_store(
  File "/home/bartoletti1/mambaforge/envs/dsm_prod_inst_e2c_local2/lib/python3.12/site-packages/xarray/backends/api.py", line 1386, in dump_to_store
    store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)
  File "/home/bartoletti1/mambaforge/envs/dsm_prod_inst_e2c_local2/lib/python3.12/site-packages/xarray/backends/common.py", line 393, in store
    variables, attributes = self.encode(variables, attributes)
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bartoletti1/mambaforge/envs/dsm_prod_inst_e2c_local2/lib/python3.12/site-packages/xarray/backends/common.py", line 482, in encode
    variables, attributes = cf_encoder(variables, attributes)
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bartoletti1/mambaforge/envs/dsm_prod_inst_e2c_local2/lib/python3.12/site-packages/xarray/conventions.py", line 795, in cf_encoder
    new_vars = {k: encode_cf_variable(v, name=k) for k, v in variables.items()}
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bartoletti1/mambaforge/envs/dsm_prod_inst_e2c_local2/lib/python3.12/site-packages/xarray/conventions.py", line 196, in encode_cf_variable
    var = coder.encode(var, name=name)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bartoletti1/mambaforge/envs/dsm_prod_inst_e2c_local2/lib/python3.12/site-packages/xarray/coding/variables.py", line 332, in encode
    fill_value = pop_to(encoding, attrs, "_FillValue", name=name)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bartoletti1/mambaforge/envs/dsm_prod_inst_e2c_local2/lib/python3.12/site-packages/xarray/coding/variables.py", line 219, in pop_to
    safe_setitem(dest, key, value, name=name)
  File "/home/bartoletti1/mambaforge/envs/dsm_prod_inst_e2c_local2/lib/python3.12/site-packages/xarray/coding/variables.py", line 200, in safe_setitem
    raise ValueError(
  0%|          | 0/1 [00:23<?, ?it/s]
failed to prevent overwriting existing key _FillValue in attrs on variable 'so'. This is probably an encoding field used by xarray to describe how a variable is serialized. To proceed, remove this key from the variable's attributes manually.
[-] Error running handlers: so
TonyB9000 commented 6 months ago

I ran the same tests on Chrysalis (v2_1 data, 1pctCO2, Omon.sosga and Omon.so), and both failed - but for different reasons. sosga (which passed on acme1) failed with:

2024-04-30 10:43:23,692 [ERROR]: mpas.py(write_cmor:451) >> Error in cmor.write for sosga
Traceback (most recent call last):
  File "/home/ac.bartoletti1/anaconda3/envs/dsm_prod_e2c_local/lib/python3.12/site-packages/e3sm_to_cmip/__main__.py", line 823, in _run_serial
    name = handler_method(
           ^^^^^^^^^^^^^^^
TypeError: handle() takes 3 positional arguments but 5 were given

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ac.bartoletti1/anaconda3/envs/dsm_prod_e2c_local/lib/python3.12/site-packages/e3sm_to_cmip/mpas.py", line 444, in write_cmor
    cmor.write(
  File "/home/ac.bartoletti1/anaconda3/envs/dsm_prod_e2c_local/lib/python3.12/site-packages/cmor/pywrapper.py", line 834, in write
    return _cmor.write(var_id, data, data_type, file_suffix, ntimes_passed,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
_cmor.CMORError: Problem with 'cmor.write'. Please check the logfile (if defined).
2024-04-30 10:43:23,692 [ERROR]: mpas.py(write_cmor:451) >> Error in cmor.write for sosga
Traceback (most recent call last):
  File "/home/ac.bartoletti1/anaconda3/envs/dsm_prod_e2c_local/lib/python3.12/site-packages/e3sm_to_cmip/__main__.py", line 823, in _run_serial
    name = handler_method(
           ^^^^^^^^^^^^^^^
TypeError: handle() takes 3 positional arguments but 5 were given

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ac.bartoletti1/anaconda3/envs/dsm_prod_e2c_local/lib/python3.12/site-packages/e3sm_to_cmip/mpas.py", line 444, in write_cmor
    cmor.write(
  File "/home/ac.bartoletti1/anaconda3/envs/dsm_prod_e2c_local/lib/python3.12/site-packages/cmor/pywrapper.py", line 834, in write
    return _cmor.write(var_id, data, data_type, file_suffix, ntimes_passed,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
_cmor.CMORError: Problem with 'cmor.write'. Please check the logfile (if defined).
^M100%|██████████| 1/1 [00:19<00:00, 19.45s/it]^M100%|██████████| 1/1 [00:19<00:00, 19.45s/it]
Converting sosga to float32

The test with "so" failed as well, but appears identical to the acme1 failure ("failed to prevent overwriting existing key _FillValue in attrs on variable 'so'").

TonyB9000 commented 6 months ago

Here is the cmor_logs logfile output for the failed sosga on chrysalis. FWIW - I harbor high emnity for this logfile format.

CMOR_LOG_output

TonyB9000 commented 6 months ago

As I wrote in slack, I will ensure that the installed cmor tables are up-to-date on both acme1 and chrysalis, then rerun both of the v2_1 Omon.sosga and Omon.so tests on each. I will also make sure that the tables-path is reiterated in the main log list of parameters applied.

chengzhuzhang commented 6 months ago

Hi @TonyB9000 for further trouble shooting, do you have intermediate files saved (i.e. before cmorization) for the variables that having this _Fillvalue issues, for example: for so from v2_1.LR.1pctCO2_0101.mpaso? Do you have a e3sm_to_cmip commandline for reproducing this issue?

chengzhuzhang commented 6 months ago

Also, please submit a PR that elides "lock=False", I believe this fixed the concurrency issue, but leaves the _FillValue issues.

TonyB9000 commented 6 months ago

Did the PR, and the merge (tests successful - whatever they are).

I'll run a test of the failed _fillValue issue - but e2c "hides" the interim regridded stuff in volatile "temp" directories. I'll see if they can be retained.

TonyB9000 commented 6 months ago

@chengzhuzhang I just ran:

dsm_generate_CMIP6_dsid_list_2.sh TEST to_test_cmip6_v2_1_1pctCO2_Omon_so

which create a script that issues:

e3sm_to_cmip -v so -u /p/user_pub/e3sm/bartoletti1/Operations/5_DatasetGeneration/AltProcess/tmp/v2_1.LR.1pctCO2_0101/metadata/1pctCO2_r1i1p1f1.json -t /p/user_pub/e3sm/staging/resource/cmor/cmip6-cmor-tables/Tables -o /p/user_pub/e3sm/bartoletti1/Operations/5_DatasetGeneration/AltProcess/tmp/v2_1.LR.1pctCO2_0101/product -i /p/user_pub/e3sm/bartoletti1/Operations/5_DatasetGeneration/AltProcess/tmp/v2_1.LR.1pctCO2_0101/native_data -s --realm Omon --map /p/user_pub/e3sm/staging/resource/maps/map_EC30to60E2r2_to_cmip6_180x360_aave.20220301.nc >> $the_sublog 2>&1

The content of the "sublog" (/p/user_pub/e3sm/bartoletti1/Operations/5_DatasetGeneration/AltProcess/tmp/v2_1.LR.1pctCO2_0101/caselogs/20240515_195553_552629-CMIP6.CMIP.E3SM-Project.E3SM-2-1.1pctCO2.r1i1p1f1.Omon.so.gr.sublog) reads:

20240515_195605_598:INFO:e3sm_to_cmip.util:find_mpas_files:find_mpas_files: component = mpaso, path = /p/user_pub/e3sm/bartoletti1/Operations/5_DatasetGeneration/AltProcess/tmp/v2_1.LR.1pctCO2_0101/native_data
2024-05-15 19:56:05,598 [INFO]: util.py(find_mpas_files:497) >> find_mpas_files: component = mpaso, path = /p/user_pub/e3sm/bartoletti1/Operations/5_DatasetGeneration/AltProcess/tmp/v2_1.LR.1pctCO2_0101/native_data
2024-05-15 19:56:05,598 [INFO]: util.py(find_mpas_files:497) >> find_mpas_files: component = mpaso, path = /p/user_pub/e3sm/bartoletti1/Operations/5_DatasetGeneration/AltProcess/tmp/v2_1.LR.1pctCO2_0101/native_data
20240515_195605_598:INFO:e3sm_to_cmip.util:find_mpas_files:results found: 12 items
2024-05-15 19:56:05,598 [INFO]: util.py(find_mpas_files:507) >> results found: 12 items
2024-05-15 19:56:05,598 [INFO]: util.py(find_mpas_files:507) >> results found: 12 items
20240515_195605_599:INFO:e3sm_to_cmip.util:find_mpas_files:find_mpas_files: component = mpas_mesh, path = /p/user_pub/e3sm/bartoletti1/Operations/5_DatasetGeneration/AltProcess/tmp/v2_1.LR.1pctCO2_0101/native_data
2024-05-15 19:56:05,599 [INFO]: util.py(find_mpas_files:497) >> find_mpas_files: component = mpas_mesh, path = /p/user_pub/e3sm/bartoletti1/Operations/5_DatasetGeneration/AltProcess/tmp/v2_1.LR.1pctCO2_0101/native_data
2024-05-15 19:56:05,599 [INFO]: util.py(find_mpas_files:497) >> find_mpas_files: component = mpas_mesh, path = /p/user_pub/e3sm/bartoletti1/Operations/5_DatasetGeneration/AltProcess/tmp/v2_1.LR.1pctCO2_0101/native_data
20240515_195605_599:INFO:e3sm_to_cmip.util:find_mpas_files:component mpas_mesh found: v2_1.LR.1pctCO2_0101.mpaso.rst.0002-01-01_00000.nc
2024-05-15 19:56:05,599 [INFO]: util.py(find_mpas_files:547) >> component mpas_mesh found: v2_1.LR.1pctCO2_0101.mpaso.rst.0002-01-01_00000.nc
2024-05-15 19:56:05,599 [INFO]: util.py(find_mpas_files:547) >> component mpas_mesh found: v2_1.LR.1pctCO2_0101.mpaso.rst.0002-01-01_00000.nc
20240515_195605_600:INFO:e3sm_to_cmip.util:find_mpas_files:find_mpas_files: component = mpas_map, path = /p/user_pub/e3sm/bartoletti1/Operations/5_DatasetGeneration/AltProcess/tmp/v2_1.LR.1pctCO2_0101/native_data
2024-05-15 19:56:05,600 [INFO]: util.py(find_mpas_files:497) >> find_mpas_files: component = mpas_map, path = /p/user_pub/e3sm/bartoletti1/Operations/5_DatasetGeneration/AltProcess/tmp/v2_1.LR.1pctCO2_0101/native_data
2024-05-15 19:56:05,600 [INFO]: util.py(find_mpas_files:497) >> find_mpas_files: component = mpas_map, path = /p/user_pub/e3sm/bartoletti1/Operations/5_DatasetGeneration/AltProcess/tmp/v2_1.LR.1pctCO2_0101/native_data
20240515_195605_615:INFO:e3sm_to_cmip.util:find_mpas_files:component mpas_map found: /p/user_pub/e3sm/staging/resource/maps/map_EC30to60E2r2_to_cmip6_180x360_aave.20220301.nc
2024-05-15 19:56:05,615 [INFO]: util.py(find_mpas_files:556) >> component mpas_map found: /p/user_pub/e3sm/staging/resource/maps/map_EC30to60E2r2_to_cmip6_180x360_aave.20220301.nc
2024-05-15 19:56:05,615 [INFO]: util.py(find_mpas_files:556) >> component mpas_map found: /p/user_pub/e3sm/staging/resource/maps/map_EC30to60E2r2_to_cmip6_180x360_aave.20220301.nc
2024-05-15 19:56:05,615 [INFO]: so.py(handle:48) >> Starting so
2024-05-15 19:56:05,615 [INFO]: so.py(handle:48) >> Starting so
  File "/home/bartoletti1/mambaforge/envs/dsm_prod_e2c_local/lib/python3.12/site-packages/e3sm_to_cmip/__main__.py", line 836, in _run_serial
    name = handler_method(
           ^^^^^^^^^^^^^^^
  File "/home/bartoletti1/mambaforge/envs/dsm_prod_e2c_local/lib/python3.12/site-packages/e3sm_to_cmip/cmor_handlers/mpas_vars/so.py", line 69, in handle
    ds = mpas.remap(ds, 'mpasocean', mappingFileName)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bartoletti1/mambaforge/envs/dsm_prod_e2c_local/lib/python3.12/site-packages/e3sm_to_cmip/mpas.py", line 93, in remap
    write_netcdf(ds, inFileName, unlimited="time")
  File "/home/bartoletti1/mambaforge/envs/dsm_prod_e2c_local/lib/python3.12/site-packages/e3sm_to_cmip/mpas.py", line 373, in write_netcdf
    ds.to_netcdf(fileName, encoding=encodingDict, unlimited_dims=unlimited)
  File "/home/bartoletti1/mambaforge/envs/dsm_prod_e2c_local/lib/python3.12/site-packages/xarray/core/dataset.py", line 2298, in to_netcdf
    return to_netcdf(  # type: ignore  # mypy cannot resolve the overloads:(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bartoletti1/mambaforge/envs/dsm_prod_e2c_local/lib/python3.12/site-packages/xarray/backends/api.py", line 1339, in to_netcdf
    dump_to_store(
  File "/home/bartoletti1/mambaforge/envs/dsm_prod_e2c_local/lib/python3.12/site-packages/xarray/backends/api.py", line 1386, in dump_to_store
    store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)
  File "/home/bartoletti1/mambaforge/envs/dsm_prod_e2c_local/lib/python3.12/site-packages/xarray/backends/common.py", line 393, in store
    variables, attributes = self.encode(variables, attributes)
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bartoletti1/mambaforge/envs/dsm_prod_e2c_local/lib/python3.12/site-packages/xarray/backends/common.py", line 482, in encode
    variables, attributes = cf_encoder(variables, attributes)
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bartoletti1/mambaforge/envs/dsm_prod_e2c_local/lib/python3.12/site-packages/xarray/conventions.py", line 795, in cf_encoder
    new_vars = {k: encode_cf_variable(v, name=k) for k, v in variables.items()}
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bartoletti1/mambaforge/envs/dsm_prod_e2c_local/lib/python3.12/site-packages/xarray/conventions.py", line 196, in encode_cf_variable
    var = coder.encode(var, name=name)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bartoletti1/mambaforge/envs/dsm_prod_e2c_local/lib/python3.12/site-packages/xarray/coding/variables.py", line 332, in encode
    fill_value = pop_to(encoding, attrs, "_FillValue", name=name)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bartoletti1/mambaforge/envs/dsm_prod_e2c_local/lib/python3.12/site-packages/xarray/coding/variables.py", line 219, in pop_to
    safe_setitem(dest, key, value, name=name)
  File "/home/bartoletti1/mambaforge/envs/dsm_prod_e2c_local/lib/python3.12/site-packages/xarray/coding/variables.py", line 200, in safe_setitem
    raise ValueError(
  0%|          | 0/1 [00:19<?, ?it/s]
failed to prevent overwriting existing key _FillValue in attrs on variable 'so'. This is probably an encoding field used by xarray to describe how a variable is serialized. To proceed, remove this key from the variable's attributes manually.
[-] Error running handlers: so

The most recent elements of my "TMPDIR" (/p/user_pub/e3sm/bartoletti1/tmp) are:

-rw-------.  1 bartoletti1 publishers        27493 May  9 05:45 tmpfhs53hel
-rw-------.  1 bartoletti1 publishers            0 May 15 12:56 tmpz89h3sws
-rw-------.  1 bartoletti1 publishers          239 May 15 12:56 tmpzgznny1v

and ncdump /p/user_pub/e3sm/bartoletti1/tmp/tmpzgznny1v gives netcdf tmpzgznny1v { }

TonyB9000 commented 6 months ago

@chengzhuzhang Presently, the only variables that are exhibiting these error are a subset of the Omon (so, sos, thetao, thkcello, tos, volcello), and ONLY for 1pctCO2 and abrupt-4xCO2.

The "job directory" created for this job (all 1pctCO2 vars), when listed chronologically, show that only the last 3 were touched:

(see /p/user_pub/e3sm/bartoletti1/Operations/5_DatasetGeneration/AltProcess/tmp/v2_1.LR.1pctCO2_0101)

drwxrwxr-x. 2 bartoletti1 publishers   4096 Apr  3 12:41 rgr_fixed_vars
drwxrwxr-x. 2 bartoletti1 publishers   4096 Apr  3 12:41 rgr_vert
drwxrwxr-x. 2 bartoletti1 publishers   4096 Apr  3 12:41 metadata
drwxrwxr-x. 3 bartoletti1 publishers   4096 Apr  3 16:57 zhold_product_20240403
-rw-rw-r--. 1 bartoletti1 publishers  24818 Apr  4 10:01 My_own_log
drwxrwxr-x. 3 bartoletti1 publishers   4096 Apr  4 17:10 zhold_product_20240404
-rw-rw-r--. 1 bartoletti1 publishers      0 Apr  5 11:03 try_log
drwxrwxr-x. 2 bartoletti1 publishers   4096 Apr  5 11:03 logs
-rw-rw-r--. 1 bartoletti1 publishers    151 Apr  9 14:13 findlog
drwxrwxr-x. 2 bartoletti1 publishers   4096 Apr 12 10:28 native_out
drwxrwxr-x. 2 bartoletti1 publishers   4096 Apr 12 10:28 rgr
drwxrwxr-x. 3 bartoletti1 publishers   4096 May 10 12:31 product
drwxrwxr-x. 2 bartoletti1 publishers 262144 May 15 12:55 native_data
drwxrwxr-x. 2 bartoletti1 publishers 262144 May 15 13:04 caselogs
drwxrwxr-x. 4 bartoletti1 publishers   4096 May 15 13:13 scripts
chengzhuzhang commented 6 months ago

Thank you @TonyB9000! Could you open permission for -rw-------. 1 bartoletti1 publishers 27493 May 9 05:45 tmpfhs53hel -rw-------. 1 bartoletti1 publishers 0 May 15 12:56 tmpz89h3sws -rw-------. 1 bartoletti1 publishers 239 May 15 12:56 tmpzgznny1v

TonyB9000 commented 6 months ago

Everything in /p/user_pub/e3sm/bartoletti1/tmp should be accessible now.

chengzhuzhang commented 6 months ago

@chengzhuzhang Presently, the only variables that are exhibiting these error are a subset of the Omon (so, sos, thetao, thkcello, tos, volcello), and ONLY for 1pctCO2 and abrupt-4xCO2.

I'm trying to visualize native MPAS output through uxarray, but it doesn't see to work out of box with their docs. @xylar We ran into _FillValue issue only with two experiments. I can't help thinking if it is relevant to the recent coupled group endeavor about model crushing in warmer climate. Do you have some script that I can use to do some quick view of MPAS data?

chengzhuzhang commented 6 months ago

@xylar Never mind. I made the script working with uxarray. Checking native data now...

chengzhuzhang commented 6 months ago

@TonyB9000 Definitely, we should consider to break large 3d variables into year chunks...I see a 316 G file with 6000 time steps. Then tmp files at this point doesn't really helpful. Could you try testing with the variable tos (a 2d variable) run e3sm_to_cmip every 10 years (ex 1-10, 11-20xxx), to see when the crash happen?

TonyB9000 commented 6 months ago

@chengzhuzhang The last run (depicted above with "2024-05-15 19:56:05,615 [INFO]: so.py(handle:48) >> Starting so", was only given 1 year of data to run with. (That is the ONLY way I know of to limit the years and file sizes. e3sm_to_cmip has no command line option to limit the years. You must create external code that calls e2c sequentially on segments of data. That is what the CWL workflows were doing...)

TonyB9000 commented 6 months ago

@chengzhuzhang I am doing another run, hopefully with more log results.

chengzhuzhang commented 6 months ago

@TonyB9000 I think I found the problem... @xylar it looks like I need to bug you after all...I think the hiccup in e3sm_to_cmip Tony has been experiencing is caused by the additional attribute _FillValue presents in the ONLY for 1pctCO2 and abrupt-4xCO2 v2.1 simulations. I'm listing below:

timeMonthly_avg_layerThickness:_FillValue = 9.96921e+36f
timeMonthly_avg_normalVelocity:_FillValue = 9.96921e+36f
timeMonthly_avg_activeTracers_temperature:_FillValue = 9.96921e+36f
timeMonthly_avg_activeTracers_salinity:_FillValue = 9.96921e+36f 

I think these are the native variables used to derive Omon (so, sos, thetao, thkcello, tos, volcello). So question, any clue why we suddenly have _FillValues in these 2 experiments...I think originally we just use cell mask, with the presence of _FillValue, how should we modify e2c?

TonyB9000 commented 6 months ago

@chengzhuzhang @xylar Long Long Ago, in a Galaxy Far Far Away . . . we addressed a similar issue by editing the native data with ncatted:

(in /p/user_pub/e3sm/bartoletti1/Operations/8_Corrections/history/FIX_FillValue/remove_FillValue_attribute.sh)


[for afile in `cat $filelist`; do

    ncatted -a _FillValue,,d,,, $target_path/$afile
    echo "cleaned $target_path/$afile"

done

But I don't know if this past solution applies (I think that was a "_FillValue=NaNf" issue)

xylar commented 6 months ago

@chengzhuzhangand @TonyB9000, we added fill values to as many variables as we could to increase CF compliance about a year and a half ago: https://github.com/E3SM-Project/E3SM/pull/5325 These fill values should be present on all E3SM v3 simulations, but I guess you haven't had to post-process any of those yet? I'm not sure why the 4xCO2 and 1pctCO2 are the only v2.1 simulations to show this behavior. I would expect this in all v2.1 and later simulations. Are these the first v2.1 simulations you have processed?

Anyway, I'm happy to look into fixing this. I suspect the fix is simply to drop the attribute if it is present, as the error message above suggests. I would do this in the handlers rather than with ncatted because that saves an extra step and some headache.

chengzhuzhang commented 6 months ago

@xylar Thank you for clarifying and the quick fix. We have not started processing v3, and was able to finish other v2.1 processing, including historical, picontrol and amip. I'm still curious, why ONLY 1pctCO2 and abrupt-4xCO2 has the _FillValue added.

xylar commented 6 months ago

I have no idea. I would expect all v2.1 runs with ocean and sea-ice output to have this issue. The feature was merged before the v2.1 tag was created.